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Abstract 

When  training  an  artificial  neural  network  (ANN)  for  classification  using  backpropa- 
gation  of  error,  the  weights  are  usually  updated  by  minimizing  the  sum-squared  error  on 
the  training  set.  As  training  ensues,  overtraining  may  be  observed  as  the  network  begins 
to  memorize  the  training  data.  This  occurs  because,  as  the  magnitude  of  the  weight  vec¬ 
tor,  1 1 w| |,  grows,  the  decision  boundaries  become  overly  complex  in  much  the  same  way  as 
a  too-high  order  polynomial  approximation  can  overfit  a  data  set  in  a  regression  problem. 
Since  ||w||  grows  during  standard  backpropagation,  it  is  important  to  initialize  the  weights 
with  consideration  to  the  importance  of  the  weight  vector  magnitude,  ||w||.  With  this  in 
mind,  the  expected  value  of  the  magnitude  of  the  initial  weight  vector  is  here  derived  for  the 
separate  cases  of  each  weight  drawn  from  a  normal  or  uniform  distribution.  The  usefulness  of 
this  derivation  is  universal  since  the  magnitude  of  the  weight  vector  plays  such  an  important 
role  in  the  formation  of  the  classification  boundaries.  When  the  network  overtrains  on  the 
training  data,  it  will  not  exhibit  consistently  low  error  on  subsequent  test  data.  One  way  to 
overcome  this  overtraining  problem  is  to  stop  the  training  early,  which  limits  the  magnitude 
of  the  weight  vector  below  what  it  would  be  if  the  training  were  allowed  to  continue  until 
a  near-global  training  error  minimum  were  found.  The  question  then  is  when  to  stop  the 
training.  Here,  the  relationship  between  training  data  set  size  and  the  magnitude  of  the 
weight  vector  providing  good  generalization  results  is  empirically  established  using  cross- 
validational  analysis  on  small  subsets  of  the  training  data.  These  results  are  then  used  to 
estimate  at  what  weight  vector  magnitude  the  training  should  be  stopped  when  using  the 
full  data  set.  The  general  classification  ability  of  an  ANN  trained  in  this  manner  is  shown 
to  increase  the  percentage  of  correctly  classified  test  data  points  by  an  average  of  1.5%  over 
that  of  one  trained  using  true  cross- validational  early  stopping  on  a  smaller  data  set.  The 
technique  of  hyperspherical  backpropagation,  which  entails  training  at  a  set  weight  vector 
magnitude,  is  also  introduced  and  shown  to  be  useful  in  lowering  the  validation  error  during 
training. 


1 


Table  of  Contents 

Page 

List  of  Figures  .  v 

Abstract .  i 

I.  Introduction  .  1-1 

1.1  Problem  Statement .  1-1 

1.2  Scope .  1-3 

1.3  Contributions .  1-4 

1.4  Organization .  1-5 

II.  Background .  2-1 

2.1  Pattern  Classification .  2-1 

2.1.1  ANNs  for  Classification .  2-1 

2.2  Weight  Initialization .  2-6 

2.3  Searching  for  the  Minimum  Error .  2-7 

2.3.1  Genetic  Approaches .  2-8 

2.4  Generalization  .  2-13 

2.4.1  Radial  Complexity .  2-15 

2.4.2  Regularization .  2-20 

2.4.3  Bayesian  Analysis  for  Classification .  2-22 

2.4.4  Cross-Validation  .  2-27 

2.4.5  Magnitude  of  the  Weight  Vector .  2-28 

2.5  Summary .  2-30 

III.  Achieving  Good  Generalization .  3-1 

3.1  Initial  Radial  Complexity .  3-1 

3.1.1  Weights  Distributed  Normally .  3-2 

iii 


Page 

3.1.2  Weights  Distributed  Uniformly  .  3-5 

3.2  Consistent  Behavior  of  Radial  Complexity  During  Training  .  .  .  3-10 

3.2.1  Consistency  of  Training  Behavior  When  Using  EP  Train¬ 
ing  .  3-18 

3.2.2  Consistency  of  Training  Behavior  When  Training  on 

TESSA  Data  Set .  3-20 

3.3  Cross- Validational  Radial  Complexity  Estimation .  3-21 

3.3.1  Generalization  Error .  3-22 

3.3.2  Early  Stopping  at  the  Estimated  Radial  Complexity  .  .  3-24 

3.4  Training  at  a  Fixed  Radial  Complexity .  3-29 

3.4.1  Genetic  Approach .  3-29 

3.4.2  Hyperspherical  Backpropagation  Approach .  3-30 

3.5  Summary .  3-35 

IV.  Conclusions  and  Recommendations .  4-1 

4.1  Conclusions .  4-1 

4.2  Recommendations  for  future  Research .  4-2 

Appendix  A.  Weight  Update  Formula .  A-l 

A.l  Standard  Batch  Backpropagation .  A-l 

A. 1.1  Second  Layer  Weight  Update  .  A-l 

A. 1.2  First  Layer  Weight  Update .  A-2 

A. 2  Weight  Updates  with  Regularization .  A-3 

Appendix  B.  Hyperspherical  Coordinate  Transformation .  B-l 

Appendix  C.  Derivation  of  Angle  Updates  for  Hyperspherical  Backpropagation  C-l 

C.l  Second  Layer  Angle  Update .  C-l 

C.2  First  Layer  Angle  Update .  C-3 

Bibliography  .  BIB-1 


IV 


List  of  Figures 


Figure  Page 

1.1.  Hand- written  Characters .  1-3 

1.2.  Infrared  Image .  1-4 

2.1.  Activation  Function .  2-2 

2.2.  ANN  Example .  2-3 

2.3.  Genetic  Algorithm .  2-10 

2.4.  Sequential  Squashing  .  2-12 

2.5.  Sequential  Squashing  Demo .  2-14 

2.6.  Sigmoid  Behavior .  2-17 

2.7.  Discriminant  Boundaries  .  2-19 

2.8.  Regularization  Effect  .  2-21 

2.9.  Bayesian  Training .  2-26 

2.10.  Cross-Validation  Example .  2-28 

3.1.  Decay  of  the  variance  of  the  radial  complexity  with  increasing  W  .  .  .  3-10 

3.2.  Growth  of  the  expected  value  of  the  radial  complexity  squared  and  the 
square  of  the  expected  value  of  the  radial  complexity  with  increasing  W  3-11 

3.3.  Growth  of  the  approximate  expected  value  of  the  radial  complexity  and 
the  average  observed  magnitude  of  the  radial  complexity  with  increasing 

W .  3-12 

3.4.  SSE  and  Standard  Backprop .  3-14 

3.5.  SSE  and  Bayesian  Backprop  .  3-15 

3.6.  Softmax  error  and  Standard  Backprop .  3-16 

3.7.  Softmax  error  and  Bayesian  Backprop .  3-17 

3.8.  EP  for  ANN  Training .  3-18 

3.9.  IR  training .  3-20 

3.10.  Cross-Validation  result  for  5  training  data  points  from  each  class.  .  .  .  3-25 


v 


Figure  Page 

3.11.  Cross-Validation  result  for  50  training  data  points  from  each  class.  .  .  3-26 

3.12.  Estimation  of  radial  complexity .  3-27 

3.13.  Cross-Validation  result  for  100  training  data  points  from  each  class.  .  .  3-27 

3.14.  Overtraining  on  the  TESSA  data .  3-28 

3.15.  Using  an  EP  to  train  the  ANN  at  a  specific  radial  complexity. .  3-29 

3.16.  Standard  backpropagation  versus  hyperspherical  backpropagation  on 

OCR  data  .  3-34 


vi 


Radial  Complexity  Estimation 
for  Improved  Generalization 
in  Artificial  Neural  Networks 


I.  Introduction 

1.1  Problem  Statement 

Artificial  Neural  Networks  (ANNs)  are  approximation  methods  of  establishing  a  re¬ 
lationship  between  an  input  vector,  x,  and  an  output  vector,  y.  The  information  about 
this  relationship  is  stored  in  a  set  of  scalars  called  weights.  In  a  feed-forward  ANN,  this 
relationship  is  usually  approximated  by  “training”  the  weights  with  a  set  of  training  vec¬ 
tors.  The  two  most  important  aspects  of  training  an  ANN  are  the  convergence  speed  and 
the  ability  to  generalize  well  [69] .  A  significant  amount  of  effort  has  gone  into  speeding  up 
the  training  time  of  an  ANN  [15,36,37,48,49,66,71,75,78,79].  Of  the  two,  though,  the 
generalization  ability  of  the  network  will  determine  its  applicability  to  a  given  problem  after 
training  [7,59];  an  ANN  which  does  not  generalize  well  will  quickly  start  to  “gather  dust” 
since  it  does  not  perform  consistently  for  new  input  data.  This  good  generalization  capa¬ 
bility  can  be  achieved  a  number  of  ways,  including  early  stopping  [7],  pruning/growing  of 
the  hidden  layer  nodes  [51],  cross- validational  early  stopping  [59],  and  regularization  [8,56] 
(including  Bayesian  methods  [10,41]).  These  methods  attempt  to  limit  the  effective  com¬ 
plexity  of  the  network,  the  effective  complexity  being  the  ability  of  the  network  to  capture 
the  underlying  structure  of  the  training  data.  If  the  effective  complexity  is  too  low,  the 
network  cannot  model  the  underlying  structure  of  the  training  data  well  so  the  error  will  be 
consistently  high  on  the  training  data  and  any  test  data.  If  the  effective  complexity  is  too 
high,  the  network  begins  to  memorize  the  training  data,  including  any  noise  in  regression 
problems  and  outliers  in  classification  problems.  This  yields  a  low  error  on  the  training  data, 
but  will  tend  to  yield  inconsistent  error  on  subsequent  test  data.  In  neither  case  is  good 
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generalization  observed,  since  the  network  is  either  insufficiently  complex  (too  general)  or 
too  complex  to  be  general  enough  to  give  consistent,  low-error  classification  results  on  future 
data. 

The  pursuit  of  good  generalization  should  determine  the  network  architecture  and  the 
magnitudes  of  the  scalar  weights.  The  complexity  of  an  ANN  necessary  to  achieve  good 
generalization  for  a  given  problem  is  determined  in  part  by  the  size  of  the  training  set  [7,83]. 
This  means  that  if  we  split  our  data  into  separate  sets  for  training,  cross-validation,  and 
testing,  we  are  limiting  ourselves  to  an  effective  complexity  that  is  smaller  than  one  that 
would  be  allowed  if  we  used  all  available  data  for  training  [59] .  There  are  three  factors  which 
constrain  the  complexity  of  the  discriminant  boundaries  when  using  an  ANN  as  a  pattern 
classifier:  the  number  of  sigmoid  building  blocks  (51),  the  magnitude  of  the  weight  vector 
(p),  and  the  training  data  (amount  available  and  intrinsic  complexity  of  the  distribution). 
These  three  factors  work  in  conjunction  with  each  other,  so,  for  example,  if  there  are  N 
training  examples,  there  will  be  some  optimal  51  that  will  prevent  overtraining  regardless 
of  the  magnitude  of  the  weight  vector,  p.  This  is  known  as  structural  stabilization.  Or,  for 
a  given  51  and  N,  we  can  limit  p  such  that  overtraining  will  not  occur.  Ideally,  though,  we 
want  to  constrain  the  complexity  using  the  training  data  to  as  great  a  degree  as  possible  since 
this  is  the  best  information  available  about  the  distribution  of  the  inputs  which  determines 
the  classification  boundaries  in  a  Bayes  optimal  classifier  [7, 16].  The  more  training  data  we 
have,  the  closer  we  can  build  our  discriminant  functions  to  the  Bayes  optimal  discriminant 
functions.  The  problem,  then,  is  to  use  as  much  training  data  as  possible  to  train  the  ANN 
so  that  we  approximate  the  Bayes’  optimal  discriminant  function  as  closely  as  possible. 
Unfortunately,  when  training  set  size  is  finite,  the  training  data  alone  frequently  will  not 
provide  adequate  constraint  of  the  complexity  of  the  discriminant  boundaries  to  prevent 
overtraining  and  the  resulting  network  does  not  perform  optimally  when  classifying  new 
data.  In  this  case,  we  then  need  to  limit  the  number  of  hidden  nodes,  51,  or  the  magnitude 
of  the  weight  vector,  p.  Bartlett  has  argued  that  limiting  p  is  more  important  than  limiting  51 
since  a  larger  number  of  sigmoid  building  blocks  (quantified  by  the  number  of  hidden  nodes, 
51)  can  provide  a  closer  approximation  to  the  Bayes  optimal  discriminant  boundaries  [4]. 
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In  the  past,  the  ANN  training  method  of  early  stopping  based  on  the  cross-validation  error 
has  proved  somewhat  successful,  but  has  not  taken  advantage  of  the  full  training  data  set 
so  as  to  best  approximate  the  Bayes  optimal  discriminant  boundaries  [16];  while  methods  of 
ANN  training  that  use  the  full  data  set  have  not  necessarily  provided  good  generalization 
capabilities  upon  completion  of  training  [7]. 

1.2  Scope 

This  research  is  limited  to  feed-forward  single-hidden-layer  ANNs.  Batch  backpropa- 
gation  is  used  since  this  method  is  theoretically  guaranteed  to  converge  to  a  solution  [7, 59] . 
The  data  sets  used  include  a  set  of  hand- written  numerical  characters  from  0  —  9  (OCR  data 
set),  an  example  of  which  is  shown  in  Figure  1.1,  as  well  as  a  set  of  infrared  image  data 
(TESSA  data  set),  an  example  of  which  is  shown  in  Figure  1.2.  These  two  data  sets  are 

0)334 

&  L>1  %  ? 

Figure  1.1  Example  of  hand-written  characters,  or  the  OCR  data  set. 

representative  of  the  types  of  data  ANNs  are  used  to  analyze  in  the  real  world,  with  the 
TESSA  data  set  being  particularly  difficult  to  classify. 
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Figure  1.2  Example  of  an  infrared  image  from  the  TESSA  data  set. 

1.3  Contributions 

Several  contributions  are  presented  to  the  field  of  Artificial  Neural  Networks.  First,  the 
expected  initial  “radial  complexity,”  defined  as  the  magnitude  of  the  weight  vector,  is  derived 
for  the  case  of  the  individual  weights  being  initialized  by  drawing  random  variables  from 
uniform  and  normal  distributions.  Work  in  the  past  has  concentrated  primarily  on  initializing 
the  weights  so  as  to  decrease  training  time,  while  the  consideration  here  is  for  improving  the 
generalization  ability.  Second,  the  radial  complexity  is  shown  experimentally  to  behave 
consistently  during  training  from  run  to  run.  This  behavior  justifies  the  cross-validational 
early  stopping  procedures  used  here  and  in  previous  work.  Third,  the  procedure  of  “radial 
complexity  estimation”  which  allows  the  weights  to  be  trained  based  on  the  magnitude  of 
the  weight  vector  is  developed.  Using  this  estimated  radial  complexity  is  shown  to  lead  to 
improvements  in  classification  ability  on  data  sets  which  are  prone  to  overtraining.  Finally 
the  method  of  “hyperspherical  backpropagation,”  is  developed  and  shown  to  lead  to  lower 
error  on  the  validation  set  during  training. 
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1.4  Organization 

Chapter  1,  Introduction,  explains  the  problem  to  be  solved,  defines  and  limits  the 
scope  of  the  research,  and  presents  the  contributions  to  the  field.  Chapter  2,  Background, 
reviews  the  current  literature  on  methods  of  achieving  good  generalization  in  multi-layer 
feed-forward  ANNs,  including  discussions  on  the  effect  that  the  radial  complexity  has  on 
the  decision  boundaries.  Chapter  3,  Achieving  Good  Generalization,  derives  the  expected 
value  of  the  radial  complexity  during  weight  initialization  and  demonstrates  the  feasibility 
of  using  pseudo-cross- validational  early  stopping  based  on  an  estimated  radial  complexity  to 
achieve  good  generalization  in  a  multi-layer  feed-forward  ANN  using  data  sets  which  exem¬ 
plify  real-world  classification  problems.  Also,  the  subject  of  hyperspherical  backpropagation 
is  developed  and  shown  to  improve  the  validation  error  during  training.  Finally,  Chapter 
4,  Conclusions  and  Recommendations,  summarizes  where  this  research  puts  the  field  and 
where  further  research  is  indicated.  The  following  chapter  describes  previous  research  in  the 
area  of  using  ANNs  for  pattern  recognition. 
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II.  Background 

Man  is  constantly  trying  to  teach  machines  to  ease  his  workload.  Some  types  of  pattern 
recognition  are  considered  quite  overwhelming  or  tedious  for  a  human;  analyzing  large  num¬ 
bers  of  mammograms  for  possible  cancer  [86],  determining  the  identity  of  a  suspect  based  on 
comparing  fingerprints  with  those  on  file  in  a  huge  database  [25],  or  recognizing  a  particular 
phoneme  in  a  set  of  speech  signals  [80].  In  this  chapter,  the  performance  of  ANNs  as  pattern 
recognizers  is  discussed,  as  are  the  steps  necessary  to  assure  that  trained  ANNs  perform 
well  when  making  decisions  after  training.  The  ideal  pattern  classifier  is  the  Bayes  optimal 
classifier  since  it  provides  the  minimum  probability  of  misclassification. 

2.1  Pattern  Classification 

The  best  error  rate  one  can  hope  to  consistently  achieve  in  any  classification  problem  is 
the  Bayes  error  rate.  This  is  the  error  achieved  when  using  a  Bayes  optimal  classifier,  which 
uses  the  distribution  of  the  inputs  to  make  classification  decisions  [7, 16]  and  minimizes  the 
probability  of  misclassification  by  using  the  posterior  probability 

P(r  i_a  _  p(x\Ck)P(Ck ) 

P{Ck]x)  ~ 

to  make  classification  decisions.  Unfortunately,  the  true  statistical  properties  of  the  input 
data  are  seldom  known,  so  various  methods  are  employed  to  mimic  the  Bayes  optimal  classi¬ 
fier.  After  training  an  ANN  as  a  pattern  classifier,  the  best  generalization  results  are  attained 
if  it  forms  Bayes  optimal  decision  boundaries. 

2.1.1  ANNs  for  Classification.  Historically,  maximum  likelihood  estimation  (MLE) 
techniques,  such  as  backpropagation  of  error,  have  been  very  popular  for  training  neural 
networks.  Several  sources  [7,  59, 60]  give  excellent  treatments  on  these  methods.  Ruck 
demonstrated  that  when  training  an  ANN  as  a  pattern  classifier  using  Sum-square  error 
(SSE),  upon  completion  of  training  it  provides  a  good  approximation  to  a  Bayes  optimal 
classifier  [61].  SSE  is  the  most  widely  used  criterion  for  evaluating  the  error  of  a  network, 
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but  other  error  functions  can  be  used  (i.e.  Minkowski  error),  and,  in  fact,  for  ANNs  used 
for  classification,  the  softmax  error  is  more  appropriate  [7]  for  approximating  the  posterior 
probability  of  an  input  belonging  to  a  specific  class.  The  outputs  of  the  network  can  be 
interpreted  as  probabilities  of  class  membership  if  we  structure  our  network  using  logistic 
sigmoid  activation  functions  (see  Figure  2.1)  for  the  hidden  layer  and  softmax  activation 
functions  (a  generalization  of  the  logistic  sigmoid  activation  function)  for  the  output  layer  [7, 
9,59].  Using  the  softmax  function  allows  us  to  interpret  the  outputs  of  the  network  as 


Figure  2.1  Demonstration  of  how  the  activation  function,  g(a,k),  is  related  to  the  output 
of  a  given  node. 

probabilities  of  class  membership  by  forcing  the  values  at  the  output  layer  of  the  network  to 
lie  in  the  range  (0,1)  and  sum  to  1.  The  softmax  function  is  defined  as 


Uk  =  g(a-k)  = 


exp  (ofc) 

£fc'exp(afc/)' 


The  summation  over  k'  is  over  all  the  outputs  and  acts  as  the  normalization  factor. 


Consider  a  classification  problem  on  the  well-known  IRIS  data  set.  This  data  contains 
three  classes  of  flowers  and  each  data  vector  consists  of  four  features.  The  database  has  150 
input  feature  vectors.  An  example  ANN  architecture  used  to  classify  this  data  is  shown  in 
Figure  2.2.  This  ANN  has  two  layers  and  five  hidden  nodes.  The  weights  feeding  into  the 
hidden  layer  are  denoted  as  Wlji,  the  biases  feeding  into  the  hidden  layer  are  denoted  as 
Blj,  the  weights  feeding  into  the  output  layer  are  denoted  as  W2kj,  and  the  biases  feeding 
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Figure  2.2  Artificial  Neural  Network  used  in  our  example. 
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into  the  output  layer  are  denoted  as  B2k .  The  form  of  the  weight  vectors  is, 


W 1  = 


B 1  = 


W  2  = 


B  2  = 


w  1M 

W  li,2 

•••  Wh,R 

Wh,! 

W12,2 

W12,R 

Wlsi.i  Wlsi,2 

Bh 

BI2 

J 

Blsi 

W2hl 

W21>2 

•••  W2hS1 

W22A 

W22>2 

W22jsi 

W2k>1 

W2k>2 

W2k>si 

B2X 

B22 

. 

B2% 

With  four  input  features,  five  hidden  nodes,  and  three  classes,  we  have  4x5  +  5x3  =  35 
weights  as  well  as  5  +  3  =  8  biases  for  a  weight  space  of  dimension  43.  The  final  weight 
vector,  after  training,  would  ideally  give  an  output  for  a  set  of  training  data  that  had  low 
error  between  the  target  vectors  and  the  output  vectors,  while  also  yielding  low  error  on  the 
test  data.  Here,  the  training  data  set  is  denoted  by 


D  =  {(xm,  t(1>), (x(w),  t<K))}, 


(2.2) 
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with  x being  the  nth  training  vector  and  the  nth  target  vector  that  represents  the  class 
membership  of  the  nth  training  vector.  A  typical  normalized  feature  vector  is 


x(1)  = 


-0.8977 

+1.0286 

-1.3368 

-1.3086 


This  feature  vector  belongs  to  class  1,  so 


1 

0 

0 


This  network  has  three  outputs:  output  one  is  the  probability  that  the  input,  x^n\  is  a 
member  of  class  one,  Ci,  given  the  training  data,  D,  and  outputs  two  and  three  are  the 
probabilities  of  belonging  to  class  two  and  three,  respectively,  such  that 


y(n)  =  p(x(«)  g  Q\D), 

y(n)  =  p(x(»)  g  C2\D^ 

vin)  =  P(x(n)  eC3\D). 

When  using  the  softmax  function  at  the  outputs,  the  error  function  used  for  classifica¬ 
tion  takes  the  form 

=  (2'3) 
n=l 1 

which  is  based  on  Bishop’s  cross-entropy  for  multiple,  mutually  exclusive  classes  [7].  t^  is 
the  target  value  at  output  k  for  input  vector  n,  while  is  the  actual  value  at  output  k 
for  input  vector  n,  and  K  denotes  the  number  of  disjoint  classes,  {Ci,C2,  ■  ■  ■  Ck)-  For 
completeness,  we  derive  the  weight  update  procedure  for  this  error  function  in  Appendix  A. 
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Upon  first  initializing  the  weights,  the  output,  yM,  for  input  vector  shown  above  is 


0.3443 

0.3178 

0.3379 


This  equal  probability  state  makes  sense  since  our  weights  are  still  in  a  random  state  and 
no  training  has  happened  yet.  After  just  five  training  epochs  (weight  updates)  through  the 
training  data  set,  D  ,  the  output  vector  for  our  class  one  training  input  vector  is 


0.9634 
0.0000  , 
0.0366 


which  is  much  closer  to  the  desired  output  of 


1 

0 

0 


We  interpret  this  output  as  the  probability  of  the  input  vector  belonging  to  class  1  is  ap¬ 
proximately  0.96,  or 

P(x(n>  G  Ci\D)  *  .96. 


The  training  algorithm  pushes  the  weight  vector  into  the  direction  necessary  to  lower 
the  error  on  the  training  set,  but  what  then  should  be  the  starting  point  of  this  weight 
vector? 


2.2  Weight  Initialization 

The  importance  of  the  initial  weight  vector  is  an  often  overlooked  part  of  the  training 
process  [29].  In  the  past,  for  standard  backpropagation,  the  approach  to  choosing  a  starting 
weight  set  has  been  to  choose  initial  weights  from  a  uniform  distribution  between  plus  and 
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minus  a,  usually  with  a  =  .5  [29].  Kolen  argues  that  the  magnitude  of  the  weight  vector  at 
initialization  plays  a  key  role  in  the  convergence  speed  of  the  backpropagation  algorithm  [29]. 
Other  methods  of  weight  initialization  have  also  focused  primarily  on  speeding  up  the  weight 
training  process  [48, 64] ,  although  Denoeux  uses  prototypes  for  weight  initialization  and 
considers  the  repercussions  that  this  initialization  can  have  on  generalization  [13].  Bayesian 
backpropagation  relies  on  a  “prior”  probability  distribution  of  the  weights  which  is  usually 
chosen  to  be  a  normal  distribution  with  a  parameter,  a,  governing  the  variance  of  that 
distribution  [7, 10,41,59].  a  is  chosen  based  on  our  prior  belief  on  how  closely  we  think  each 
weight  is  to  zero.  Although  the  point  of  regularization  is  to  create  a  better  generalized  ANN, 
there  has  not  previously  been  a  direct  relationship  established  between  a  and  the  ability  of 
the  network  to  generalize  well. 

After  initialization,  the  weights  are  usually  trained  using  backpropagation  of  error,  but 
does  the  resultant  set  of  weights  provide  the  lowest  possible  error  on  the  training  set? 

2.3  Searching  for  the  Minimum  Error 

With  MLE  techniques,  such  as  back-propagation  of  error,  the  error  is  computed  as  a 
function  of  the  weights  and  some  gradient  descent  technique  is  used  to  find  a  local  minimum. 
Though  this  takes  account  of  only  one  of  many  possible  minima  in  weight  space,  the  results 
are  frequently  satisfactory  enough  to  justify  our  limited  search.  If  the  network  does  not 
converge  to  an  adequate  solution,  the  standard  procedure  is  to  restart  the  algorithm  with  a 
new  random  set  of  weights  [12]  to  find  a  more  suitable  solution. 

Fogel  [18]  points  out  that  one  popularly  accepted  disadvantage  of  many  MLE  tech¬ 
niques  (such  as  gradient  descent  along  the  error  surface)  is  the  propensity  for  the  weight 
vector  to  become  trapped  in  a  local  error  minimum  that  is  unsuitable.  Although  these 
methods  converge  to  a  solution  quickly,  once  perceived  to  be  trapped  in  an  unacceptable 
local  minimum  the  algorithm  is  usually  reinitialized  with  a  random  weight  vector  and  the 
training  restarted  [12].  Much  effort  has  gone  into  finding  the  global  error  minimum  when 
training  the  weights  [34,82],  but  Lawrence  has  recently  cast  aspersions  upon  this  proce¬ 
dure  [33] .  He  points  out  that  the  minimum  error  found  by  gradient  descent  methods  when 
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training  an  ANN  is  often  significantly  worse  than  the  global  minimum  error.  This  makes 
sense  when  considering  generalization  ability  of  an  ANN;  while  low  error  is  obviously  a  de¬ 
sirable  characteristic,  the  weight  vector  yielding  the  lowest  error  obtainable  on  the  training 
set  will  almost  never  be  the  weight  vector  yielding  the  lowest  error  on  the  test  set.  This 
subject  will  be  further  expanded  in  section  2.4.  Other  methods  exist  for  training  ANNs,  one 
such  being  based  on  evolution  in  living  organisms. 

2.3.1  Genetic  Approaches.  Another  method  of  determining  the  weights  in  an  ANN 
is  by  letting  the  weights  evolve  over  time  in  such  a  way  as  to  mimic  the  evolution  of  an 
organism.  These  genetic  approaches  have  become  more  popular  for  searching  out  local  error 
minimums  in  ANNs  [27,34,47,53,67,69,81,84]. 

2.3.1. 1  Genetic  Algorithms.  Genetic  Algorithms  (GAs)  have  been  used  to 
determine  weights  in  neural  networks  with  varied  success  [28,30,65].  GAs  are  loosely  based 
on  models  of  genetic  change,  or  evolution,  in  populations  of  individual  organisms  [22].  Each 
organism  (weight  vector)  is  defined  as  a  chromosome,  which  in  turn  is  made  up  of  some 
pre-determined  number  of  genes  (bits  representing  weights).  These  genes  are  often  treated 
as  binary  values,  so  a  typical  chromosome  would  be  represented  by  a  string  of  genes,  or 
vector,  such  as  (100110000111010. ..11001)T.  The  fitness  of  each  of  these  organisms  can  be 
measured,  and  possible  goals  include  invoking  an  evolutionary  process  to  either  improve  the 
overall  fitness  of  the  population  or  to  obtain  a  highly  fit  single  member.  This  idea  of  fitness 
governs  the  extent  to  which  an  individual  organism  can  influence  future  generations,  and 
genetic  operators  have  been  developed  to  propagate  this  influence.  Crossover  and  mutation 
are  the  operators  most  often  used,  where  crossover  is  the  swapping  between  two  chromosomes 
of  some  subset  of  their  genes,  and  mutation  is  the  bit-switch  of  randomly  selected  genes  in 
a  chromosome.  Crossover  allows  organisms  to  evolve  much  more  rapidly  than  they  would  if 
each  offspring  simply  contained  a  copy  of  the  parent  chromosome,  occasionally  modified  by 
a  mutation,  and  corresponds  to  a  large  step  size  in  our  weight  space.  Mutation,  on  the  other 
hand,  offers  the  opportunity  for  new  genetic  material  to  be  introduced  into  the  population, 
producing  a  more  robust  search  of  the  entire  solution  space,  and  mutation  corresponds  to  a 
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small  step  size  in  our  search  over  weight  space.  A  typical  GA  search  is  described  below  and 
illustrated  in  Figure  2.3. 


2. 3. 1.2  Typical  GA  Search. 


1.  Randomly  generate  an  initial  population  of  chromosomes. 

2.  Test  the  fitness  of  each  chromosome  and  save  the  single  chromosome  which  is  most  fit 
in  “most-fit”  queue. 

3.  Generate  a  new  population  from  the  old  population  using  fitness  of  members  of  old 
population  and  a  “roulette  wheel”  random  sorting  with  greater  fitness  increasing  the 
probability  of  being  picked. 

4.  Perform  crossover  and  mutation  over  entire  new  population. 


Each  chromosome  has  a  random  chance  for  crossover.  When  tagged,  a  chromosome 
will  interchange  some  subset  of  its  genes  with  the  same  subset  of  genes  in  another  tagged 
chromosome,  for  example 


(2.4) 


Each  gene  has  a  random  chance  for  mutation.  When  tagged,  a  bit  switch  occurs 
(i.e.,  1  — ►  0  and  0  — >  1). 
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5.  Test  the  fitness  of  each  chromosome  in  the  new  population  and  save  the  single  fittest 
chromosome  if  it  is  more  fit  than  the  chromosome  currently  in  the  most-fit  queue. 

6.  The  new  population  takes  the  place  of  old  population  and  loops  back  to  step  3. 

7.  Continue  until  some  termination  criteria  is  met  (high  fitness,  number  of  generations, 
etc.). 

8.  Chromosome  in  most-fit  queue  is  solution. 


Figure  2.3  Typical  Genetic  Algorithm  Search  [23]. 


Korning  points  out  that  training  weights  in  an  ANN  requires  very  large  chromosomes 
to  ensure  sufficient  dynamic  range  for  the  magnitudes  of  the  weights  [30].  Angeline  [3] 
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gives  good  reasons  to  believe  that  GAs  are  less  than  optimal  for  training  neural  networks, 
especially  when  compared  with  the  more  general  Evolution  Program. 

2. 3. 1.3  Evolution  Programs.  Evolution  Programs  (EPs)  are  a  generalization 
of  Genetic  Algorithms.  While  Genetic  Algorithms  are  generally  considered  to  be  limited  to 
binary  representations  of  the  data,  Michalewicz  points  out  that  EPs  use  whatever  form  is 
most  useful  (usually  a  form  closely  related  to  the  actual  data)  to  solve  a  given  problem  [43]. 
A  typical  chromosome  may  then  be  simply  a  vector  of  real  numbers.  With  EPs,  we  are 
not  limited  to  or  bound  by  the  crossover  and  mutation  operators  typically  used  in  GAs. 
If  mutation  and  crossover  are  used,  and  if  real  numbers  replace  binary,  possible  solutions 
pointed  out  by  Michalewicz  are: 

1.  For  crossover,  simply  swap  some  subset  of  genes  (now  real  numbers  instead  of  bits)  as 
in  GA  crossover. 

2.  For  mutation,  replace  the  tagged  gene  with  a  new  real  number  generated  by  some 
probability  distribution. 

EPs  have  been  used  to  find  the  weights  in  ANNs  [3,57,68].  Fogel  argues  that  mutation 
is  the  dominant  operator  in  Evolutionary  Programming  [18],  and  Angeline  points  out  that 
crossover  may  be  particularly  inappropriate  when  training  weights  for  a  neural  network  [3]. 
Porto  [57]  uses  only  a  mutation  operator  and  a  fitness  function  based  on  the  sum-squared 
error.  He  perturbs  chromosomes  with  a  normal  random  variable  whose  variance  is  equal  to 
that  error,  at  which  point  new  and  old  chromosomes  compete  to  find  a  new  population  that 
has  an  overall  lower  error.  EPs  and  GAs,  though,  converge  to  single  solutions;  when  more 
than  one  solution  is  desired,  we  need  an  algorithm  that  can  find  multiple  solutions. 

2. 3. 1-4  Sequential  Niche  Techniques.  Evolution  Programs  are  fundamen¬ 
tally  tools  to  solve  maximization  problems,  finding  a  local  maximum  of  a  function  using 
appropriate  genetic  operators.  Beasley  [5]  developed  a  method  that  allows  conventional  EP 
techniques  to  be  used  to  find  an  arbitrary  number  of  local  maxima  of  a  function  with  several 
local  maxima.  This  method  is  outlined  in  Figure  2.4  and  summarized  as  follows  [5]: 
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Figure  2.4  Illustration  of  Sequential  Squashing  of  the  Fitness  Function  [5]. 


2-12 


Sequential  Niche  Algorithm. 


1.  Find  an  initial  local  maximum  using  any  standard  EP  technique. 

2.  Modify  the  fitness  function  in  the  region  surrounding  the  initial  local  maximum  using 
a  “squashing”  function  to  eliminate  it  from  subsequent  searches. 

3.  Search  for  as  many  additional  local  maxima  as  desired,  re-modifying  the  fitness  function 
after  each  solution  is  obtained  to  eliminate  that  solution  from  the  search.  The  locations 
of  all  local  maxima  are  stored  in  memory  and  the  fitness  function  is  modified  to  squash 
all  local  maxima  found  in  previous  searches. 

Beasley  discusses  a  number  of  squashing  functions,  one  of  which  is 


G(r)  = 


ml  dx’/r  if  dxs  <  r, 
1  otherwise 


(2.5) 


where  m  >  0  is  the  desired  multiplicative  factor  at  the  center  of  the  solution  on 
subsequent  fitness  tests,  r  is  the  user-defined  radius  about  the  solution  that  will  be  affected 
by  G,  and  dxs  is  the  actual  distance  of  the  chromosome  from  the  solution  mode. 

Figure  2.5  demonstrates  the  application  of  Equation  (2.5).  Here,  we  have  a  four  maxima 
fitness  function  in  two-dimensional  space.  As  the  search  locates  each  maximum,  the  fitness 
function  is  modified  to  eliminate  that  maximum  from  the  search  until  all  four  maxima  of  the 
function  are  found. 

Having  discussed  some  of  the  most  widely  accepted  methods  for  training  ANNs,  we 
need  to  recall  that  the  primary  concern  of  any  ANN  design  should  be  that  the  network 
generalizes  well  when  tested  on  previously  unseen  data. 


2-4  Generalization 

An  ANN’s  usefulness  after  training  is  completely  determined  by  its  ability  to  correctly 
analyze  future  data  generated  by  the  same  process  that  generated  the  training  data  [1, 
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Figure  2.5  Plot  (a)  shows  the  unaltered,  initial  fitness  function.  Plot  be  shows  the  al¬ 
tered  fitness  function  after  one  of  the  maxima  has  been  found  by  the  EP  and 
subsequently  squashed.  Plots  (c)  and  (d)  show  finding  and  squashing  of  the 
remaining  maxima  and  how  this  affects  the  fitness  function  each  time. 

11,31,33,35,44,50,54,58,62,76,85].  A  number  of  researchers  have  bounded  the  expected 
generalization  error  [21,42],  and  some  attempt  to  estimate  the  generalization  ability  of  an 
ANN  after  training  [52,63,73,74].  It  is  well  known  that  reducing  the  complexity  of  the 
network  leads  to  better  generalization  [14].  Reducing  the  complexity  entails  limiting  the 
architecture  or  limiting  the  value  of  the  individual  weights  [19,24].  Efforts  to  achieve  good 
generalization  by  limiting  the  number  of  free  parameters,  or  structural  stabilization,  have 
been  carried  out  using  growing  and/or  pruning  of  weights  and  hidden  nodes  [14,51]. 

There  are  a  number  of  ways  to  limit  the  value  of  the  individual  weights  so  as  to  im¬ 
prove  generalization.  Regularization  appends  a  function  of  the  weights  to  the  error  function 
so  as  to  penalize  large  weight  vector  magnitudes  during  training,  while  cross-validational 
early  stopping  bases  the  stopping  criteria  on  the  validation  set  classification  error  rather 
than  on  the  training  set  classification  error,  thereby  stopping  training  while  the  weight  vec¬ 
tor’s  magnitude  is  smaller  than  it  would  be  if  only  the  training  set  error  was  taken  into 
consideration  [7, 59] .  Some  research  investigates  adding  noise  to  the  training  data  so  as  to 
“smear  out”  the  data  and  make  overtraining  less  likely  [26,72],  but  Bishop  indicates  that 
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this  method  is  closely  related  to  limiting  the  magnitude  of  the  weights  in  much  the  same 
way  as  regularization  [7] .  Another  form  of  limiting  the  value  of  the  weights  is  soft  weight 
sharing.  Once  again,  though,  Bishop  points  out  that  this  is  another  form  of  regularization. 

The  magnitude  of  the  weight  vector  is  central  to  the  issue  of  good  generalization  and  is 
here  referred  to  as  the  “radial  complexity.”  The  next  section  gives  an  example  showing  how 
changing  the  magnitude  of  the  weight  vector  affects  the  decision  boundaries  for  classification. 

2.4.1  Radial  Complexity.  The  fact  that  the  magnitude  of  the  weight  vector  is 
important  for  good  generalization  has  been  used  in  regularization  by  many  researchers  [7, 
59],  and  Bartlett  has  pointed  out  that  the  size  of  the  individual  weights  (which  is  directly 
correlated  with  the  magnitude)  is  more  important  than  the  number  of  weights  in  problems 
where  ANNs  are  useful  [4].  Dunne  analyze  the  evolution  of  the  weights  in  a  simple,  two 
feature,  two  weight,  problem  as  the  polar  coordinate  characteristics  of  the  weights  were 
varied  [17].  To  understand  how  the  magnitude  of  the  weight  vector  affects  the  classification 
ability  of  the  ANN  trained  as  a  pattern  classifier,  we  need  to  understand  how  an  ANN 
constructs  discriminant  boundaries  from  summations  of  weighted  sigmoids.  There  are  many 
factors  which  affect  how  the  decision  boundary  is  constructed.  First,  let  us  look  at  the 
equation  for  the  output  of  a  hidden  node, 


h  =  9(wjiXi  +  wj2x2  +  ...  +  WjRxR  +  bj ),  (2.6) 

where  Wji  is  the  weight  from  input  Xi  to  hidden  node  j,  bj  is  the  bias  feeding  into  node  j, 
and  g(-)  is  the  logistic  sigmoid  output  activation  function 


9{a)  =  ~ 


+  e~ 


(2.7) 


Another  sigmoid  activation  function  often  used  is  the  tanh  activation  function, 


9(*)  =  ^ 


ga  —  g  a 


ea  +  e  a  5 


(2.8) 
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but  is  related  to  the  logistic  sigmoid  through  a  linear  transformation  (which  can  be  done  by 
the  weights  and  biases),  so  the  choice  of  which  to  use  is  irrelevant  at  the  hidden  layer  [7]. 
Each  hidden  node  contributes  a  sigmoid  which  is  the  basic  building  block  of  the  discriminant 
boundaries. 

The  logistic  sigmoid  is  a  non-decreasing  function  and  is  approximately  linear  in  the 
region  around  a  =  0.  If  we  consider  this  region  then  we  see  that  the  output  of  the  hidden 
node  (as  a  function  of  the  input)  lies  on  a  hyperplane  whose  slope  is  determined  by  the 
weights  [17]  and  whose  position  is  determined  by  the  bias  term.  Where  this  hyperplane 
crosses  through  g(0)  forms  a  preliminary  decision  boundary.  Therefore,  we  can  express  this 
individual  decision  boundary  as 


■WjiXx  +  wj2x2  +  ...  +  WjRxR  +  bj  =  0. 


(2.9) 


For  simplicity,  let  us  consider  the  two-feature  case.  The  extension  into  multiple  dimensions 
is  straightforward.  Now 

WjiXi  +  wj2x2  +  bj  =  0.  (2.10) 


So,  if  Wj2  ±  0,  then 


z2 


WjiXi  +  bj 


wj2 


Wji  _ 
wj2  1  wj2  ■ 


(2.11) 


From  Equation  (2.11),  we  can  see  that  the  slope  and  intercept  of  the  decision  boundary 
are  functions  of  the  ratio  of  weights.  Changing  the  magnitude  of  a  vector  containing  these 
three  weights  has  no  effect  on  the  location  of  this  individual  sigmoid  in  input  space,  but 
does  change  the  slope  of  the  hyperplane  approximation  in  the  linear  region  of  the  logistic 
sigmoid.  Figure  2.6  demonstrates  how  changing  a  weight’s  magnitude  changes  the  slope  of 
the  sigmoid  in  the  linear  region.  Keep  in  mind,  though,  that  a  given  output,  y^,  is  determined 
by  a  function  that  is  the  shifted  weighted  sum  of  all  the  sigmoids  from  all  the  hidden  nodes 
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Figure  2.6  Demonstration  of  how  the  slope  of  the  linear  region  of  the  sigmoid  function 
(the  region  around  <7(0))  changes  with  changing  weight  magnitude.  The  steeper 
slope  corresponds  to  a  larger  weight  magnitude  while  a  shallower  slope  corre¬ 
sponds  to  a  lower  weight  magnitude. 
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so  that 

Vk  =  9(wk,lzl  +  wk,  2Z2  +  •••  +  Wk,SlzSl  +  bk),  (2-12) 

where  Wkj  is  the  weight  going  from  hidden  node  j  to  output  node  k,  and  bk  is  the  bias 
feeding  into  node  k.  The  form  of  the  activation  function  at  each  output  node  is  important 
in  interpreting  each  output’s  meaning  for  training,  but  is  not  the  primary  factor  affecting 
the  decision  boundaries  in  the  input  space.  Each  weight  going  into  the  output  layer  again 
alters  the  slope  of  the  approximation  hyperplane  used  to  build  each  individual  decision 
boundary.  The  output  bias  translates  all  the  decision  boundaries,  but  affects  all  individual 
sigmoids  used  to  create  the  decision  boundaries  equally.  The  primary  factor,  then,  in  the 
determination  of  the  decision  regions  is  the  ratio  of  weights  and  biases,  not  the  actual  values. 
The  importance  in  the  actual  values  of  the  weights  is  seen  when  looking  at  the  summation 
of  the  sigmoids.  Where  the  sigmoids  intersect  forms  a  “corner,”  and  the  “sharpness”  of  this 
corner  depends  on  the  steepness  of  the  two  sigmoids;  two  steep  sigmoids  intersecting  will 
form  a  sharp  corner,  while  two  shallow  sigmoids  meeting  will  have  a  more  rounded  corner. 
Remember,  the  position  of  the  sigmoid  decision  boundary  is  a  function  of  the  ratio  of  weights, 
while  the  steepness  of  the  sigmoid  is  a  function  of  the  magnitude  of  weights.  In  order  for  a 
network  to  overtrain,  it  needs  to  “reach  out  and  grab”  outlying  data  points  that  lie  within 
what  should  statistically  be  another  class’s  decision  space.  To  construct  a  decision  region 
such  as  this,  narrow  and  long,  would  require  sharp  corners.  By  limiting  the  magnitude  of 
the  weight  vector,  we  have  limited  the  ability  of  the  ANN  to  construct  sharp  corners  and 
therefore  limited  it’s  ability  to  overtrain.  Figure  2.7  shows  the  effect  that  changing  the 
magnitude  of  the  weight  vector  has  on  the  decision  boundaries.  This  figure  shows  contour 
plots  of  a  triangular  decision  region  formed  from  three  sigmoids  in  two-dimensional  space. 
Notice  that  decreasing  the  radial  complexity  has  a  “low-pass”  filtering  effect  on  the  high 
frequency  corners. 

ANN  training  that  concentrates  on  structural  stabilization  (limiting  the  number  of 
weights)  attempts  to  limit  complexity  by  limiting  the  number  of  sigmoids  used  to  build 
the  discriminant  functions,  which  we  see  now  is  the  functional  equivalent  of  limiting  the 
magnitude  of  the  weight  vector  since  a  smaller  magnitude  weight  vector  will  require  more 
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Figure  2.7  Example  showing  how  changing  the  magnitude  of  the  weight  vector  changes  the 
ability  of  the  ANN  to  make  decisions  based  on  discriminant  boundaries.  The 
magnitude  of  the  weight  vector  decreases  as  we  move  from  top  left,  plot  (a),  to 
bottom  right,  plot  (f).  Notice  that  the  position  of  the  discriminant  boundaries 
remains  unchanged,  but  the  sharp  corners  are  rounded  off  with  a  decrease  in 
magnitude. 
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sigmoids  to  accomplish  the  same  type  of  discrimination  accomplished  by  fewer  weights  with 
unlimited  magnitude.  One  way  to  keep  the  magnitude  of  the  weight  vector  small  is  to  use 
regularization. 

2.4-2  Regularization.  Regularization  is  a  way  to  keep  the  magnitude  of  the  weight 
vector  relatively  small  so  as  to  minimize  over-training  in  a  network  and  achieve  good  gen¬ 
eralization  [2,  7,  55,  56,  59].  The  error  function  using  regularization  is  the  sum  of  the  log 
likelihood  error,  Ed,  and  the  regularization  error,  Ew, 


S(  w)  =  Ed  +  Ew 

=  flMI2, 

n=l k=l  Z 

where 


(2.13) 

(2.14) 


w 

=  =  (2.15) 

and  cx  >  0  is  the  regularization  coefficient.  Regularization  effectively  warps  the  error  surface 
by  adding  it  to  a  hyper-parabola  centered  at  the  origin,  thus  favoring  weight  vectors  closer 
to  zero.  Figure  2.8  demonstrates  the  effect  of  regularization  graphically.  The  weight  update 
repercussions  due  to  regularization  are  discussed  in  Appendix  A. 2.  Ripley  [59]  feels  that 
regularization  of  some  sort  should  always  be  used  when  training  an  ANN.  Notice  that  the 
effect  of  regularization  is  to  decrease  the  effective  complexity  by  decreasing  the  magnitude 
of  the  weight  vector  as  the  network  is  being  trained. 


The  regularization  coefficient  used  in  regularization  determines  the  amount  that  the 
weight  vector’s  magnitude  is  penalized,  and  there  is  a  direct  relationship  between  this  coef¬ 
ficient  and  the  ability  of  the  final  network  design  to  generalize  well  [7].  Larsen  attempted  to 
find  the  regularization  coefficient  that  yielded  the  optimal  generalization  performance  [32], 
while  Bayesian  backpropagation,  discussed  in  the  next  section,  allows  this  parameter  to  be 
changed  while  the  network  is  being  trained  [7,10,41], 
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Figure  2.8  Effect  of  using  the  magnitude  squared  of  the  weight  vector  for  regularization. 

Notice  that  the  new  error  function  is  just  the  old  error  function  warped  to  lie 
on  the  surface  of  a  quadratic  bowl. 


2-21 


Some  forms  of  regularization  attempt  the  use  of  the  full  data  set  as  the  training  set,  but 
standard  regularization  relies  on  the  regularization  coefficient,  a,  to  determine  the  penalty 
placed  on  the  magnitude  of  the  weight  vector,  a  can  be  chosen  by  using  cross-validation 
to  determine  at  what  a  overtraining  occurs.  This  again,  though,  limits  the  size  of  the 
training  data  set  used  to  determine  a,  thereby  imposing  too  stringent  a  penalty  on  p  and 
overconstraining  the  complexity  of  the  discriminant  boundaries  that  would  be  allowed  if  a 
could  be  determined  using  all  the  data. 

Bayesian  backpropagation  is  a  form  of  regularization  that  attempts  to  overcome  this 
overconstraining  by  using  all  available  data  when  updating  the  regularization  coefficient. 

2.4-3  Bayesian  Analysis  for  Classification.  Recently,  Bayesian  techniques  have 
been  shown  to  be  useful  in  terms  of  analyzing  different  aspects  of  neural  network  architec¬ 
ture  [7,10,38-41,46].  With  Bayesian  backpropagation,  the  regularization  parameter  a  is 
updated  during  the  training  process.  The  limitations  of  this  technique  lie  in  the  approxi¬ 
mation  of  the  error  surface  as  a  Gaussian  function  in  the  area  local  to  the  most  probable 
weight  vector,  w^p,  which  yields  lowest  error  on  the  training  set.  Here,  though,  there  is  no 
guarantee  that  the  resultant  weight  vector  provides  the  lowest  generalization  error.  With 
this  Gaussian  approximation,  the  Hessian  needs  to  be  computed  and  it’s  eigenvalues  found 
in  order  to  update  a  during  training.  A  further  approximation  that  avoids  the  evaluation  of 
the  Hessian  actually  uses  the  current  magnitude  of  the  weight  vector  to  update  a,  thereby 
simply  loosening  the  restrictions  on  the  current  weight  vector’s  magnitude  and  carrying  the 
training  further  from  the  search  for  an  optimal  generalization  ability.  These  techniques  have 
been  used  for  (among  other  things)  training  the  weights  and  choosing  one  network  model  over 
another.  Rather  than  finding  a  local  acceptable  weight  vector  which  minimizes  regression  or 
classification  error,  the  Bayesian  method  (in  its  purest  form)  integrates  over  all  weight  space 
when  calculating  the  output  of  the  ANN.  When  discussing  Bayesian  techniques,  marginal¬ 
ization  becomes  a  topic  of  prime  interest,  since  we  need  to  integrate  out  the  dependence  of 
our  answer  on  the  weights.  For  example,  the  output  of  an  ANN,  in  the  strictest  Bayesian 
sense,  would  be 
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£(AT+D  _  f  f,x(N+ 1)>WWW  |  p)dW) 


(2.16) 


where  /  (•)  is  the  function  represented  by  the  network,  D  is  the  training  data  set  D  = 
{(x^,  t^), (x^ ,  t^)},  and  p  (w|.D)  is  the  probability  density  function.  This  integration 
considers  the  outputs  resulting  from  all  possible  solutions  in  weight  space  weighted  by  the 
posterior  distribution  of  the  weights  at  those  points.  Therefore,  final  layer  outputs  resulting 
from  weights  that  lie  in  an  area  of  high  posterior  distribution  will  contribute  more  to  the 
integration  solution  than  outputs  resulting  from  weights  that  lie  in  an  area  of  low  posterior 
distribution. 

According  to  Bayes’  Theorem  [7],  if  p  (•)  is  the  pdf  and  P  (•)  is  the  cdf  on  Rw,  then 


p(w  |  D)  =  P(D  |  for  a11  w  £  R*  (2.17) 

Notice  that  Bayes’  Theorem  can  be  interpreted  as  saying  that  the  posterior  distribution 
of  the  weights  is  equal  to  the  probability  of  a  data  set  being  correctly  classified  given  that 
set  of  weights  (the  likelihood),  weighted  by  the  ratio  of  the  value  of  the  weight  prior  density 
function  at  that  point  in  weight  space  to  the  Probability  of  the  data  set. 

When  using  the  softmax  function  so  that  the  outputs  approximate  the  posterior  prob¬ 
ability  of  belonging  to  the  correct  class,  the  likelihood  function  is 


P(£>|w)~nn(!'i”>)‘“-  (2.18) 

n= 1 k= 1 

The  likelihood  function  is  the  probability  that  all  outputs  are  from  the  correct  class 
(since  y ^  is  the  actual  output  at  node  k  and  is  the  desired  output  at  node  k)  and  is  a 
multiplication  over  all  outputs  and  all  data  vectors  from  all  the  different  classes. 
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The  prior  probability  density  function  (pdf),  p( w),  is  based  on  our  belief  in  the  form 
the  final  weight  vector  should  take.  Usually,  the  bias/variance  argument  which  affects  the 
generalization  ability  of  the  network  suggests  that  the  solution  weight  vector  will  be  relatively 
small  [7],  so  p(w)  is  usually  chosen  to  be  a  Gaussian  such  that 


P(w)=(^)'exp(-|||wf).  (2.19) 

This  prior  pdf  imposes  the  same  type  of  restrictions  on  the  magnitude  of  the  weight 
vector  as  does  the  regularization  procedure  discussed  in  Section  2.4.2.  The  regularization 
factor,  a,  is  commonly  referred  to  as  a  hyperparameter  and  is  discussed  in  Section  2. 4.3.1. 
Integrating  over  all  weight  space  can  prove  somewhat  impractical,  so  approximations  are 
unavoidable. 

2. 4-3.1  Gaussian  Approximation  for  Bayesian  Training.  MacKay  [41],  as 
well  as  Buntine  and  Weigend  [10],  applied  the  Bayesian  approach  to  ANN  training  for  prac¬ 
tical  applications.  They  make  the  assumption  that  the  error  surface  in  the  vicinity  of  the 
“most  probable”  weight  vector  (one  with  lowest  error  over  the  training  set),  w mp,  is  lo¬ 
cally  a  Gaussian  function.  This  approximation  allows  the  area  in  the  vicinity  of  Wmp  to 
be  evaluated  as  a  quadratic  error  function,  the  analysis  of  which  is  straightforward  but  can 
require  the  evaluation  of  the  Hessian  matrix.  The  final  error  function,  S'(w),  takes  the  form 
of  Equation  (2.13). 

Using  the  Gaussian  approximation,  the  hyperparameter,  a ,  is  initially  chosen  so  as  to 
represent  our  confidence  in  our  initial  assumption  about  the  tightness  of  the  prior  density, 
p( w),  about  zero  (^  is  the  variance  of  the  prior’s  Gaussian  shaped  distribution).  The  net 
is  then  trained  using  any  standard  technique  which  incorporates  regularization,  and  a  is 
treated  as  the  regularization  coefficient  and  adjusted  every  few  epochs  during  training. 

The  result  of  this  technique  (as  summarized  by  Bishop  [7]  and  applied  solely  to  clas¬ 
sification)  is  demonstrated  in  Figure  2.9  and  yields  the  following  algorithm. 
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Bayesian  Backpropagation. 


1.  Choose  an  initial  positive  value  for  the  hyperparameter  a.  Initialize  the  weights  in  the 
network  using  values  drawn  from  the  prior  distribution,  p( w). 

2.  Train  the  network  using  a  standard  non-linear  optimization  algorithm  to  minimize  the 
total  error  function  S( w)  given  in  Equation  (2.13). 

3.  Every  few  cycles  of  the  algorithm,  re-estimate  values  for  a.  This  can  require  evaluation 
of  the  Hessian  matrix  and  evaluation  of  its  eigenvalue  spectrum,  or  the  use  of  one  of 
the  approximations  mentioned  below. 

4.  Stop  when  criteria  is  met. 

The  evaluation  of  the  Hessian  matrix  can  be  avoided  by  using  an  approximation  to 
update  values  for  a.  This  is  simply  [7] 


2Ew 


(2.20) 


If  we  let  p  be  the  magnitude  of  the  weight  vector,  ||w||,  then 

Ew  =  ^2wi  =  (2.21) 

z  »=i  z 

and 

W 

anew  =  (2.22) 

P2 

Therefore,  it  would  appear  that  changing  a  using  this  approximation  simply  tracks  a  change 
in  p  and  allows  the  network  to  train  in  the  vicinity  of  the  new  radial  complexity.  The  Gaussian 
approximation  allows  Bayesian  backpropagation  to  train  the  ANN  using  regularization  with 
a  dynamic  regularization  coefficient,  but  some  would  argue  that  it  is  closer  to  the  intent  of 
Bayes’  rule  to  estimate  the  initial  integration  instead. 


2. 4-3. 2  Sampling  Posterior  Distribution  in  Weight  Space.  One  way  to  sidestep 
MacKay’s  Gaussian  approximation  is  to  approximate  the  initial  integration  of  Equation  (2.16)  [7, 
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Figure  2.9  Flow  chart  for  Bayesian  training  of  the  weights  in  an  ANN 


46].  By  finding  a  set  of  weight  vectors  where  p(w  |  D)  is  relatively  large  in  weight  space,  we 
can  approximate  the  integral  in  Equation  (2.16)  with 


i  M 

»(W+1,~  (2-23) 

where  M  is  the  number  of  sample  points  in  weight  space.  Neal  [45,46]  demonstrated  the  ap¬ 
plicability  of  this  method  by  using  a  modified  Monte-Carlo  search  to  find  different  maximum 
points  in  the  posterior  distribution  of  the  weights. 

While  regularization  limits  the  magnitude  of  the  weight  vector  based  on  the  regular¬ 
ization  coefficient,  early  stopping  limits  the  magnitude  of  the  weight  vector  by  stopping  the 
training  (and  therefore  the  growth  of  the  magnitude  of  the  weight  vector)  based  on  the  error 
obtained  on  the  validation  set  which  is  not  used  to  update  the  weights  [7]. 

2-4-4  Cross-Validation.  Cross-validation  estimates  the  generalization  error  by 
using  the  cross-validation  set  error  as  an  approximation  to  the  true  generalization  error. 
Cross-validational  early  stopping  is  a  popular  tool  for  training  ANNs  since  it  increases  the 
generalization  ability  of  the  network  after  training  [59].  In  cross-validation,  the  training  data 
set  is  broken  into  multiple  sets,  new  training  sets  and  validation  sets.  The  weights  are  up¬ 
dated  using  any  standard  MLE  approach,  but  now  the  error  on  the  validation  set  is  tracked 
along  with  the  error  on  the  training  set.  When  cross-validational  early  stopping  is  used, 
the  weight  vector  chosen  in  the  one  that  minimizes  the  error  on  the  cross-validation  set.  As 
Bishop  points  out  [7],  cross-validational  early  stopping  limits  the  effective  complexity  of  the 
ANN  since  the  ANN  is  trained,  not  until  a  set  error  is  achieved  on  the  training  set,  but  until 
the  error  on  the  validation  set  begins  to  increase.  Since  standard  backpropagation  techniques 
usually  attempt  to  set  the  magnitude  of  the  weight  vector  to  a  small  value  at  initialization 
and  the  weights  grow  during  training  (see  Section  3.2),  the  cross-validational  stopping  cri¬ 
terion  is  correlated  with  the  effective  complexity  of  the  ANN  rather  than  a  training  error 
minimization. 
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Figure  2.10  shows  an  example  of  cross-validation.  The  two  curves  represent  errors 


Figure  2.10  Cross-validation  training  on  the  TESSA  data  set.  Here,  the  error  is  plotted 
versus  the  training  epoch. 

obtained  on  two  data  sets,  training  data  and  validation  data.  The  training  data  always  has 
the  lower  error  since  it  is  a  biased  estimate  of  the  error  that  will  be  observed  on  the  test 
data,  while  the  validation  error  diverges  from  the  training  error  and  eventually  begins  to 
increase  with  increasing  complexity. 

2-4-5  Magnitude  of  the  Weight  Vector.  The  impetus  behind  regularization  is  the 
attempt  to  limit  the  effective  complexity  of  the  ANN  by  limiting  the  magnitude  of  the  weight 
vector.  Saseetharran  points  out  that  small  initial  weights  prevent  saturation  of  the  sigmoid 
activation  functions,  but  quickly  grow  into  the  saturation  regions  [64].  Ruck  indicates  that 
a  limitation  in  the  Bayes  optimal  classifier  approximation  would  occur  if  the  structural 
complexity  of  the  network  (number  of  hidden  nodes)  was  too  low  [61].  This  implies  that 
we  need  some  minimum  complexity  to  approximate  a  Bayes  optimal  classifier,  below  which 
the  approximation  breaks  down.  When  limiting  the  structural  complexity  of  the  network, 
a  popular  tool  that  has  been  used  for  bounding  the  generalization  error  is  the  “Vapnik- 
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Chervonenkis,”  or  “V-C,”  dimension  [77].  In  the  case  of  generalization  of  ANNs,  we  want 
to  consider  the  generalization  error  of  a  given  architecture  for  a  data  set  of  size  N,  denoted 
pAr(y),  as  compared  to  the  average  generalization  ability,  g(y).  In  that  vein,  we  write 


P(max  |0jv(y)  -  g(y)\  >  e)  <  4A(2JV)e  *  ” .  (2.24) 

(y) 

Equation  2.24  states  that  the  probability  of  the  max  difference  in  generalization  errors  being 
greater  than  e  is  bounded  by  some  function  of  the  number  of  training  samples,  N,  and  e. 
The  growth  function,  A  (N),  gives  the  number  of  dichotomies  which  can  be  implemented  by 
the  ANN  on  a  set  of  N  training  samples.  Vapnik  and  Chervonenkis  showed  that  this  growth 
function  is  either  identically  equal  to  2N  for  all  N,  or  is  bounded  above  by  the  relation 

A  (AT)  <  Ndvc  +  1,  (2.25) 

where  dye  is  the  V-C  dimension,  and  Ndvc  is  the  number  of  patterns  that  a  given  network 
architecture  can  memorize.  Once  the  number  of  training  samples  becomes  greater  than 
Ndvc,  A(N)  begins  to  slow  down  compared  with  the  exponential  term  in  Equation  (2.24) 
and  we  can  see  that  the  right  hand  side  of  Equation  (2.24)  becomes  arbitrarily  small  by 
making  N  sufficiently  large.  The  primary  downside  to  the  V-C  dimension  analysis  is  that 
it  yields  an  extremely  conservative  estimate  of  the  number  of  training  data  points  necessary 
to  train  an  ANN  to  achieve  good  generalization  results  [4,7]. 

Bartlett  [4]  showed  that  for  ANNs  used  for  classification,  the  size  of  the  individual 
weights  is  more  important  for  generalization  than  is  the  number  of  weights.  He  indicates  that, 
even  with  a  number  of  weights  much  larger  than  is  called  for  based  on  the  V-C  dimension,  if 
the  effective  complexity  is  limited,  the  generalization  ability  is  not  compromised,  and,  in  fact, 
larger  networks  can  generalize  better  because  they  can  create  a  larger  number  of  discriminant 
functions.  The  effective  complexity  based  on  the  magnitude  of  the  weights  rather  than  the 
number  of  weights  is  what  regularization  and  early  stopping  attempt  to  minimize. 
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2.5  Summary 

In  this  chapter,  we  reviewed  how  it  has  been  shown  that  good  generalization  is  the  key 
determinant  when  deciding  on  the  suitability  of  an  ANN  for  pattern  recognition  tasks.  In  the 
context  of  limiting  the  effective  complexity  of  an  ANN,  the  methods  of  regularization  and 
cross-validational  early  stopping  were  examined  and  showed  that  limiting  the  magnitudes 
of  the  weights  during  training  tends  to  yield  good  generalization  results.  The  next  chapter 
examines  the  characteristics  of  the  radial  complexity,  including  a  method  of  training  an  ANN 
to  an  estimated  radial  complexity  that  will  have  improved  generalization  characteristics  after 
training. 
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III.  Achieving  Good  Generalization 

The  goal  when  training  an  ANN  is  to  consistently  achieve  the  lowest  error  on  future  data 
sets.  Training  the  ANN  to  achieve  low  error  on  the  training  set  does  not  guarantee  low  error 
on  future  data.  In  fact,  training  to  low  error  on  the  training  set  can  lead  to  overtraining 
and  actually  introduce  higher  error  on  the  subsequent  test  set  data.  Thus,  regularization 
pushes  the  weight  vector  in  the  direction  of  lower  magnitude,  thereby  limiting  the  ability 
of  the  ANN  to  form  overly-complex  decision  boundaries  and  overtrain,  and  early  stopping 
stops  the  training  process  before  overtraining  can  occur  by  also  limiting  the  complexity  of 
the  decision  boundaries. 

When  standard  training  begins  in  an  ANN,  each  weight  is  usually  initialized  to  some 
“small”  random  value  so  as  to  allow  “growing  room,”  [12]  while  in  Bayesian  backpropagation, 
the  weights  are  initialized  based  on  our  belief  in  what  the  final  form  of  the  weight  vector  will 
be.  Since  the  magnitude  of  the  weight  vector  is  directly  related  to  the  ability  of  the  ANN  to 
form  complex  decision  boundaries,  the  initial  weight  vector  should  be  based  directly  on  the 
desired  initial  radial  complexity. 

3.1  Initial  Radial  Complexity 

As  discussed  in  Section  2.2,  the  initial  value  of  the  weights  plays  an  important  and  often 
overlooked  part  in  the  training  process.  One  aspect  of  the  Bayesian  discussion  of  a  prior 
pdf  simply  formalizes  what  has  always  been  done  when  training  ANNs;  namely  initializing 
the  weight  vector,  w,  based  on  a  prior  belief  about  the  final  form  of  the  weights.  Given 
the  importance  of  the  radial  complexity  to  the  generalization  ability  of  the  final  network, 
the  expected  initial  radial  complexity  generated  by  our  initialization  of  the  weights  must  be 
considered  before  training  is  begun.  We  know  that  to  find  the  expected  value  of  the  radius, 
£(p),  we  have 


OO 

t(p)  =  /  pfp(p)dp, 

0 


(3.1) 
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but  this  requires  knowledge  about  the  probability  density  function,  fp(p).  Here  £(•)  is  the 
expectation  operator  with  respect  to  the  pdf  p(-). 

3.1.1  Weights  Distributed  Normally.  To  find  fp(p),  we  start  with  each  individual 
weight  distribution.  The  first  case  considered  is  when  each  Wi  has  a  distribution  that  is 
N(0,<r2)  (the  prior  pdf  we  have  been  discussing  for  Bayesian  training).  Define  variables  Xi 
such  that  Xi  is  N(0,1),  then  Wi  =  aXi  is  N(0, a2).  Furthermore,  define  a  variable  yi  such  that 

2/i  =  w]  =  (<™»)2  =  a2xj.  (3.2) 

We  know  that 


p2  =  2/i  +  2/2  +  ...  +  yw 

=  g2x\  +  a2x\  +  ...  +  a2Xw, 


(3.3) 


therefore 

=  x2  +  x2  +  ...  +  x^.  (3.4) 

Define  z  such  that 

then  we  know  that  z  has  a  Chi-square  distribution,  f z{z),  for  the  random  variable  Z,  with 
W  degrees  of  freedom  [20],  so 


f z(z)  =  { 


0 


,z>  0 
,z  <  0 


(3.6) 


What  we  want,  though,  is  the  expected  value  of  p.  This  is  found  by  rearranging  Equa¬ 
tion  (3.5)  to  read 

p  =  G\fz.  (3.7) 

We  can  now  find  the  expected  value  of  p  using 
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00 

z(p)  =  J  pfp(p)dp, 

0 

=  (.{cF'/z) 

00 

=  J  (Ty/zfz(z)dz 

-00 

oo 

=  f  (Jsfz — 7 — v - ^VV/2-1  x_  dz 

J  v  r(f)2^/2  PV  2/ 

oo 

-  _ * _  [  zW/2-l/2  f  \  Az 

r  (f ) 2W/2  /  PV  2r2- 

Setting  m  =  and  a  =  |,  and  consulting  the  CRC  Standard  Mathematical  Tables  [6], 
we  see  that 


Hence, 


(3.8) 

(3.9) 


(3.10) 

(3.11) 


Equation  (3.11)  is  the  exact  expression  for  the  expected  value  of  the  radius  of  a  weight 
vector  of  length  W  when  each  element,  Wi,  is  drawn  from  a  prior  distribution  which  is 
N(0,<t2).  This  expression  becomes  unwieldy  for  large  weight  vectors,  though,  because  the 
calculations  of  the  numerator  and  denominator  are  both  functions  of  W.  Since  this  expression 
becomes  infeasible  for  even  moderately  large  values  of  W,  we  can  seek  an  approximation  for 
Equation  (3.11)  by  using  Stirling’s  approximation  for  the  Gamma  functions  in  the  numerator 
and  denominator  [6],  The  first  term  in  Stirling’s  approximation  states,  for  x  >  10, 
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r  (#)  ~  xx  exp  (— x)  =  xx  2  exp  (— x)  \/2n, 

which  is  reasonable  since  we  are  dealing  with  large  W  (W  >  20).  Using  this  approximation, 
then,  yields 


r(f)  "  (f)W/2-1/2eXp(-f)v^ 

-  [(■4)1‘(?)‘-(4)' 

Observe  a  property  of  Euler’s  number  e  [6]  is  (since  W  is  large) 

(  1  \W 
lim  1  +  —  =  e. 

W— oo  \  W  J 

Substituting  property  (3.13)  into  Equation  (3.12)  gives,  for  large  W, 


(3.12) 


(3.13) 


(3.14) 


Now,  putting  Equation  (3.14)  into  Equation  (3.11)  gives  an  approximation  for  the  expected 
value  of  p  when  W  is  large 


£  (p)  ~  aVW.  (3.15) 

Summarizing,  if  a  weight  vector,  w,  of  size  W  has  elements  drawn  from  a  normal 
distribution  such  that  each  Wi  is  N(0,cr2),  the  expected  value  of  the  magnitude  of  that  weight 
vector,  if  W  is  large,  can  be  approximated  using  Equation  (3.15). 
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Notice  that  if  we  look  at  the  relationship  between  £  (p)  and  £  (p2),  we  find  the  variance 
of  p  to  be 

var  (p)  =  £  (p2)  -  [£  (p)]2 .  (3.16) 


Re-examining  a  previous  variable,  z,  we  see  once  again  that 


This  allows  us  to  find  £  (p2)  since 


therefore, 


(3.17) 


(3.18) 


This  is  a  simple  calculation  since  Z  is  a  Chi-square  random  variable  with  W  degrees  of 
freedom,  and  the  expected  value  for  a  Chi-square  random  variable  is  simply  the  number  of 
degrees  of  freedom.  This  leads  us  to  the  conclusion 


(  (p‘)  = 


(3.19) 


Notice  that  in  this  case  (remember  that  when  W  is  large,  £  (p)  =  a^/W) 

*(/>*)-[«  o>)i2-  <3-20) 

This  implies  that  when  W  is  large,  the  variance  of  p  is  negligible. 

3.1.2  Weights  Distributed  Uniformly.  Next  we  concentrate  on  the  problem  of 
finding  the  expected  value  of  the  radius  when  each  individual  weight  is  drawn  from  a  uniform 
distribution.  In  ANNs,  it  is  common  to  initialize  the  weight  vector  with  small  values  drawn 
uniformly  between  —.5  and  +.5  [12].  Here,  we  generalize  and  say  the  weights  are  drawn  from 
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a  distribution  that  is  uniform  over  the  region  —a  to  a,  (a  >  0).  This  leads  to  a  probability 
density  function  of  the  form 


,  .  r-  for  -  a<w  <  a 

fivH  =  {  2a 

0  otherwise 


(3.21) 


We  wish  to  find  the  expected  value  of  the  radius,  p  =  yw\  +  +  ...  +  wly.  Define  a  random 

variable  X  having  a  pdf 

f  i  fnr  —  1  <  T.  <  1 

(3.22) 


fx(x)  =  {  2  f°r  1-X-1 
0  otherwise 


We  can  now  state  that  Ws  =  ox,-  is  a  uniform  random  variable  from  —a  to  a.  This  leads  to 


which  yields 


p  =  +  w|  + ...  ~+wb 

—  y  (axi)2  +  (0X2)2  +  •••  +  (oxvv)2 
=  a\[x  \  +  hc\  +  ...  +  x^, 


t(p)  =  t  (ay/- 


X?  +  Xo  +  ...  +  x2 


w 


=  af  ^x?  +  x^  +  ...  +  x^ 


We  know  that 


f  +  x|  +  ...  +  x^  =  /  j  x  x  V®?  +  x2  +  -  +  x 

PX1  (xi)pX2  (x2)...pXw  (xw)dxidx2...dxvK, 


which  leads  to 


£  =  Q)  J  J  •••  J  y/x\+x\  +  ...  +x^  dxidx2...dxw.  (3.23) 


3-6 


The  integration  in  Equation  (3.23)  does  not  have  a  closed  form  solution  so  we  cannot  give 
an  expression  for  the  expected  value  of  £  as  a  function  of  W ;  but  we  know  the  variance  of 
our  random  variable,  p,  is 

var  (p)  =  f  (p2)  -  [£  (p)]2 .  (3.24) 

This  leads  to 

—  var  (p).  (3.25) 

Remembering  the  results  of  our  approximation  when  the  weights  were  initially  drawn  from  a 
normal  distribution,  namely  that  var(p)  was  negligible  for  large  W,  then  if  we  can  establish 
that  this  is  the  case  when  the  weights  have  a  uniform  distribution  as  well,  we  can  estimate 
f  (p)  using 

£(p)  =  v/W)-  (3-26) 

We  first  need  to  calculate  £  (p2).  We  define  a  random  variable,  y,  such  that 

y  =  w\  (3.27) 


Using  the  Square  Law  from  Thomas’s  book  [70],  we  find  that  the  pdf  of  y  is  then 


fY  (y)  =  1 


2-v/y 


0 


for  0  <  y/y  <  a 
otherwise 


(3.28) 


First, 

fj v  (— s/y)  =  ^  for  -  a  <  -y/y  <  0,  (3.29) 

and 

f w  (Vv)  =  ^  for  0  <  y/y  <  a.  (3.30) 

But  multiplying  the  limit  terms  on  the  negative  radical  by  —1  yields 


2  a 

J_ 

2a 
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for  a  >  y/y  >  0 
for  0  <  y/y  <  a. 


Now  squaring  the  limits  gives 


Finally,  we  have  then 


fw  (-Vv)  = 

1 

2a 

for  0  <  y  <  a 

(\/y)  = 

1 

2  a 

for  0  <  y  <  a2 

fy(l/)  =  < 

r  j_+x 

iaJLia. 

1  2VV 

for  0  <  y  <  a2 

1° 

otherwise 

f  1 

1  2ax/y 

for  0  <  y  <  a2 

1  o 

otherwise 

We  will  need  the  expected  value  of  y, 

00 

£  (y)  =  J  yfy(y)dy 

— 00 

“2  1 

=  Iy^viv 

=  hl^Ay 


Now,  if  we  define  z  again  to  be 


z  —  yi  +  yi  +  •••  +  yw, 


(3.31) 


(3.32) 


(3.33) 


then  we  know 


=  W€(*n) 
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So  then,  with 


(3.34) 


we  have 


(3.35) 


Now,  remember  that  we  are  trying  to  establish  that  var(p)  becomes  negligible  as  W 
increases.  Although  £(p)  =  a£  +  •••  +  cannot  be  found  analytically,  we  can 

use  numerical  integration  to  establish  its  behavior  as  W  increases,  thereby  establishing  the 
behavior  of  ap  as  W  increases.  For  this  exercise,  it  will  be  convenient  to  re-express  Equa¬ 
tion  (3.24)  in  the  form 

£  =  e(P2)  £(p)]2 

a2  a2  a2 


Approximating  the  integration  in  the  above  expression  numerically  for  values  of  W  from  1  to 
10,  we  can  glean  the  behavior  of  the  variance  of  the  complexity  as  W  increases.  Figure  3.1 
shows  how  the  variance  of  the  radial  complexity  decreases  with  increasing  W  and  Figure  3.2 
demonstrates  how  the  expected  value  of  the  square  of  the  complexity  grows  along  with  the 
square  of  the  expected  value  of  the  complexity  with  increasing  W.  Figure  3.3  shows  the 
difference  between  the  calculated  approximation  of  £(p)  and  the  average  magnitude  of  p 
observed  when  initializing  a  set  of  weights  using  a  uniform  distribution  and  this  analysis 
supports  our  use  of  the  approximation  in  Equation  (3.36). 

From  this  analysis,  it  is  reasonable  to  use  the  approximation  of  Equation  (3.26)  and 
estimate  the  expected  value  of  the  complexity  as 


(3.36) 
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0.084 


Figure  3.1  Demonstration  of  the  decay  of  the  variance  of  the  radial  complexity,  (^f  )2,  with 
increasing  W. 

With  the  determination  of  the  expected  initial  radial  complexity,  what  is  the  behavior  of 
that  radial  complexity  during  different  types  of  weight  training? 

3.2  Consistent  Behavior  of  Radial  Complexity  During  Training 

One  method  of  determining  when  the  weights  in  the  ANN  are  “good  enough”  is  based 
on  the  training  error;  once  the  training  set  error  is  at  or  below  some  specified  value,  one 
can  discontinue  training  and  test  the  ANN’s  ability  to  classify  a  test  data  set  [7,59].  Having 
shown  that  the  distribution  used  to  initialize  the  weights  determines  the  expected  value  of 
the  initial  radial  complexity  of  the  weight  vector,  it  is  desired  to  establish  the  relationship 
between  the  radial  complexity  and  the  training  error  during  training  of  the  ANN  since  this 
radial  complexity  contributes  greatly  to  the  ability  of  the  ANN  to  generalize  well  to  new 
data.  In  this  section,  we  demonstrate  one  aspect  of  the  behavior  exhibited  by  the  radial 
complexity  during  training  of  an  ANN  using  standard  backpropagation  and  regularization 
with  sum-squared  and  softmax  error  on  the  OCR  and  TESSA  data  sets.  Using  an  ANN  with 
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[d)3]  pire  (zd)3 


Figure  3.2  Demonstration  of  the  growth  of  the  expected  value  of  the  radial  complexity 
squared  and  the  square  of  the  expected  value  of  the  radial  complexity  with 
increasing  W. 
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Figure  3.3  Demonstration  of  the  growth  of  the  approximate  expected  value  of  the  radial 
complexity  (dotted)  and  the  average  observed  magnitude  of  the  radial  com¬ 
plexity  (solid)  with  increasing  W.  The  plot  on  the  top  shows  that,  for  small 
W,  the  approximation  is  not  very  good;  but  the  plot  on  the  bottom  shows  that 
for  increasingly  large  W,  the  approximation  is  much  better. 
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70  hidden  nodes  and  10  output  nodes  (one  for  each  class),  the  demonstrations  in  this  section 
show  an  almost  deterministic  behavior  of  the  radial  complexity  as  the  training  occurs.  In 
Figures  3.4,  3.5,  3.6  and  3.7,  looking  at  the  plots  in  each  figure  going  from  left  to  right, 
the  weights  are  re-initialized  using  values  drawn  from  a  prior  Gaussian  distribution  which  is 
set  for  low  radial  complexity  (£  (p)  =  .1)  for  standard  backpropagation  or  high  complexity 
(£  (p)  =  95)  for  Bayesian  backpropagation.  Each  figure  consists  of  5  different  runs  from  left 
to  right.  The  only  difference  between  each  plot  is  that  the  weights  are  re-initialized  using 
the  same  distribution  each  time.  The  complexity  measure  is  that  of  radial  complexity,  or 
the  magnitude  of  the  weight  vector.  Figure  3.4  shows  the  consistency  of  radius  growth  when 
using  SSE  and  standard  backprop  on  the  OCR  data  set,  Figure  3.5  shows  the  consistency 
of  radius  decay  when  using  SSE  and  Bayesian  backprop  on  the  OCR  data  set,  Figure  3.6 
shows  the  consistency  of  radius  growth  when  using  softmax  error  and  vanilla  backprop  on 
the  OCR  data  set,  and  Figure  3.7  shows  the  consistency  of  radius  decay  when  using  softmax 
error  and  Bayesian  backprop  on  the  OCR  data  set.  Notice  that  as  the  training  ensues,  even 
though  the  weights  have  been  randomly  reinitialized  each  time,  the  way  the  error  and  radial 
complexity  change  is  consistent;  that  is  to  say  that  the  error  and  radial  complexity  change 
in  the  same  manner  with  each  subsequent  run. 

This  consistent  behavior  leads  us  to  the  conclusion  that,  when  basing  the  stopping  cri¬ 
teria  on  the  training  set  error,  the  final  radial  complexity  of  the  ANN  is  the  same  to  within 
the  variance  of  the  initial  magnitude  of  the  weight  vector;  but  is  the  training  method  respon¬ 
sible  for  the  consistent  behavior  of  the  radial  complexity?  The  next  section  demonstrates 
the  results  of  using  a  non-backpropagation  training  method  for  the  weights. 
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Figure  3.4  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 
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Figure  3.5  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 
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Figure  3.6  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 
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Figure  3.7  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 


3.2.1  Consistency  of  Training  Behavior  When  Using  EP  Training.  The  consistency 
demonstrated  above  is  not  a  result  of  using  backpropagation.  Backpropagation  is  not  the 
only  method  for  setting  the  weights  in  an  ANN.  In  fact,  the  final  form  the  weight  vector 
takes  should  be  independent  of  the  method  used  to  obtain  that  vector.  Evolution  programs 
provide  an  alternative  method  for  ANN  weight  training.  The  EP  technique  discussed  in 
section  2. 3. 1.3  was  used  to  analyze  the  behavior  of  the  radius  as  shown  in  Figure  3.8.  This 


Epoch 


Figure  3.8  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 

figure  demonstrates  how  the  radial  complexity  behaves  when  using  an  evolution  program. 
The  squashing  function  is  incorporated  into  the  EP  to  preclude  getting  the  same  solution. 
The  radial  complexity  also  displays  consistent  behavior  when  being  trained  using  the  EP. 
Clearly,  the  behavior  of  the  radial  complexity  as  the  ANN  training  seeks  out  a  weight  vector 
yielding  a  minimum  error  is  not  dependent  on  the  backpropagation  of  error  routine. 
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The  above  analysis  was  carried  out  on  a  hand- written  character  set.  Does  this  data  set 
give  rise  to  an  error  surface  that  would  cause  the  radial  complexity  to  exhibit  this  consistent 
behavior?  The  next  section  shows  that  different  data  also  gives  rise  to  this  type  of  behavior. 
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3.2.2  Consistency  of  Training  Behavior  When  Training  on  TESSA  Data  Set  . 
Would  the  radius  still  exhibit  the  same  consistent  behavior  on  a  different  data  set  than 
the  character  data  set  in  the  previous  section?  The  TESSA  data  set  contains  infrared 
images  that  are  very  hard  to  classify.  Training  on  this  data  set  reveals  that  the  radial 
complexity  behaves  consistently  during  training  even  when  the  network  is  not  converging 
to  a  solution  (Figure  3.9).  This  figure  demonstrates  the  consistent  behavior  of  the  radius 
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Figure  3.9  Each  plot  along  the  top  row  tracks  the  training  set  error  versus  the  training 
epoch  while  each  plot  along  the  bottom  row  demonstrates  how  the  magnitude 
of  the  weight  vector  changes  as  training  ensues.  Each  column  is  an  independent 
training  run  with  the  only  change  between  columns  being  the  re-initialization 
of  the  weight  vector. 

when  training  on  the  infrared  target  data  set.  Note  the  consistency  with  which  the  radius 
behaves  even  when  the  error  is  not  converging  to  a  solution.  Once  again,  we  see  that  the 
behavior  of  the  radial  complexity  remains  consistent  between  runs  even  when  the  weights 
are  re-initialized  to  different  random  values  each  time. 

This  analysis  of  the  behavior  of  the  radial  complexity  during  training  also  gives  us 
better  insight  as  to  how  cross- validational  early  stopping  limits  the  magnitude  of  the  weight 
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vector;  the  training  (and  therefore  the  growth  of  the  radial  complexity)  is  stopped  earlier 
than  if  training  set  error  were  the  stopping  criterion. 

The  consistency  of  behavior  of  the  radial  complexity  is  independent  of  the  training 
method  and  the  data  set.  Having  shown  in  this  section  that  re-initializing  an  ANN  with 
a  different  set  of  weights  drawn  from  the  same  distribution  does  not  guarantee  a  different 
radial  complexity  when  training  is  complete  (in  fact,  training  to  some  set  training  error 
will  most  likely  put  us  into  the  same  radial  complexity  every  time),  and  that  the  radial 
complexity  consistently  changes  during  training,  we  need  a  way  to  estimate  at  what  radial 
complexity  to  halt  the  training  so  that  we  can  achieve  good  generalization  after  the  training 
is  complete.  The  next  section  presents  the  steps  taken  to  obtain  a  good  generalization  radial 
complexity  and  then  train  to  that  complexity.  Cross-validation  plays  a  key  role  in  empirically 
establishing  the  relationship  between  generalization  error  and  radial  complexity. 

3.3  Cross-  Validational  Radial  Complexity  Estimation 

When  using  the  full  data  set  as  the  training  set,  we  cannot  use  cross- validatory  early 
stopping  since  we  now  have  no  data  for  a  cross-validation  set.  We  can,  though,  estimate  the 
expected  radial  complexity  for  the  full  data  set  that  will  provide  enough  constraint  on  the 
discriminant  boundaries  such  that  a  Bayes  optimal  discriminant  function  is  best  approxi¬ 
mated.  White  [83]  describes  how  the  structural  complexity  of  an  ANN  grows  as  the  number 
of  training  data  points,  or  experience,  grows.  Bartlett  [4]  augments  this  analysis  with  his 
proof  that  the  magnitude  of  the  weights  (quantified  here  as  the  radial  complexity)  can  be 
more  important  to  generalization  than  the  number  of  weights.  Therefore,  in  classification 
problems  for  which  ANNs  are  well  suited,  the  effective  complexity  should  also  grow  in  the 
same  manner  as  the  structural  complexity  when  the  training  data  set  grows  [7,59].  Estimat¬ 
ing  the  radial  complexity  allowed  for  a  data  set  of  size  N  that  provides  good  generalization 
characteristics  is  accomplished  by  using  cross-validation  on  smaller  data  sets,  slowly  adding 
more  data  and  each  time  using  cross-validation;  this  allows  us  to  see  how  the  radial  com¬ 
plexity  that  achieves  good  generalization  grows  as  the  size  of  the  data  set  increases.  Using 
regression  on  these  radial  complexities  (letting  the  radial  complexity  be  a  function  of  M,  the 
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number  of  training  points),  we  can  infer  the  final  expected  radial  complexity  of  the  ANN 
that  will  provide  good  generalization  characteristics  when  training  using  the  entire  data  set, 
and  use  hyperspherical  backpropagation  to  train  at  that  radial  complexity.  This  allows  the 
use  of  all  data  to  train  the  ANN  and  still  expect  good  generalization  characteristics  when 
training  is  complete. 

3.3.1  Generalization  Error.  Let  the  training  set  be, 

Aram  =  {(x(1),  t(1)),  ...,  (xW,  tW)},  (3.37) 

and  the  loss  function  be  l  ( Amin ,  wT) .  The  error  function  on  the  training  data  set  is, 

1  N 

Strain  (w)  =  —  )  )  l  {D train  i  w)  .  (3.38) 

71=1 

Using  any  learning  rule  (e.g.,  backpropagation),  one  gets  a  sequence  of  weight  vectors,  {wT}, 
where  r  is  the  time  step  taken  by  one  pass  through  the  training  data  (one  epoch).  The  loss 
function  is  typically  either  the  least  square  error 

l  (Aram,  WT)  =  ]■  £  I W ~Vk  (x(n),  WT)  |2  , 

1  k= 1 

or  the  softmax  error 

l  {D train,  WT)  =  -  £  4^  ln  Vk  (x(n),  WT)  . 
fc= 1 

The  generalization  error  for  a  given  weight  vector,  w,  is  the  expected  value,  £  [•] ,  of 
the  loss  function  over  all  possible  future  example  test  data  sets, 

Egen  (w)  =  ^Test  [l  (Dtest,  w)]  .  (3.39) 

Optimally,  the  trained  network  will  yield  the  minimum  generalization  error, 

Egen  =  min  [Egen  (w)] , 
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and  is  therefore  defined  as 


E*gen  =  min  i^Test  [*  (A ert.  w)]]  . 

Given  a  cross-validation  data  set, 

Dev  =  {(x2)»  t2)),  (*2°.  t2°)},  (3.40) 

the  error  on  the  cross-validation  set  is, 

A«(w)  =  (A»,W). 

n=l 

The  generalization  error  can  be  approximated  as 

Egen  (WT)  ~  Ecv  (WT)  , 

where  r  is  the  time  step,  or  epoch.  If  cross-validational  early  stopping  is  used,  then  the 
optimal  generalization  error  can  be  approximated  as, 

Egen  -  mTin  [Ecv  (WT)]  . 

The  ability  of  the  network  to  form  complex  discriminant  boundaries  is  quantified  using  the 
radial  complexity,  p,  therefore  we  are  trying  to  empirically  establish  the  relationship  between 
the  generalization  error  and  the  radial  complexity.  Given, 

Egen  (w)  =  f  (\fy%  +  +  + 

=  Acn(IHI), 

then  the  generalization  error  can  be  expressed  as  a  function  of  p  such  that  p  =  ||wT*||,  where 
r*is  the  integer  where  training  is  stopped.  The  approximate  generalization  error  can  then 
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be  represented  as, 


Egen  (||WT*||)  ~  Ecv  (wT*)  , 

therefore 

Egen  ( p )  ^  Ecv  (wT*)  .  (3.41) 


If  several  training  and  cross-validation  sets  are  available,  the  expected  optimal  generalization 
error  as  a  function  of  p  is  then  estimated  by  taking  the  average  minimum  error  over  all  cross- 
validation  sets 


jp* 

; gen 


(p)  - 


Q 


(3.42) 


where  Ej$ is  the  error  on  the  cross-validation  set  at  the  qth  run,  and  Q  is  the  number 
of  separate  cross-validation  runs.  Note  that  the  dependency  of  the  approximate  optimal 
generalization  error  on  the  radial  complexity  is  now  explicit  since  individual  solution  weight 
vectors  are  not  the  desired  outcome,  but  instead  the  relationship  between  generalization 
error  and  the  magnitude  of  the  weight  vector  is  of  primary  interest. 


Define  the  optimal  radial  complexity,  p^\  for  a  data  set  of  size  Mt  at  run  q  as  the  one 
that  corresponds  to  the  weight  vector  that  yields  minimum  error  on  the  validation  set  for 
that  run.  The  optimal  radial  complexity,  PoPT,  for  a  classification  problem  represented  by 
subsets  of  size  Mt  can  then  be  defined  as  the  average  of  the  radial  complexities  indicated  by 
cross-validational  early  stopping  on  those  data  sets, 


(t) 

POPT 


Q 

!>?• 


(3.43) 


This  average  radial  complexity  is  the  one  that  yields  the  average  minimum  cross-validation 
error  over  the  Q  cross-validation  runs.  The  minimum  radial  complexity  is  not  what  is  sought 
since  the  radial  complexity  indicated  by  each  cross-validation  run  will  be  dependent  on  how 
closely  matched  the  randomly  chosen  training  set  is  to  the  randomly  chosen  validation  set. 


3.3.2  Early  Stopping  at  the  Estimated  Radial  Complexity.  As  the  number  of  train¬ 
ing  data  points  grows,  the  expected  good-generalization  radial  complexity  grows  as  well.  We 
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trained  an  ANN  so  as  to  obtain  15  different  radial  complexities.  The  radial  complexities 
demonstrated  increasing  growth  with  increasing  number  of  training  data  points.  Each  radial 
complexity  was  an  average  over  100  runs  whose  training  and  validation  data  was  randomly 
drawn  from  the  larger  data  set.  Figure  3.10  shows  the  results  of  using  cross-validation  on 
data  sets  with  5  training  data  points  from  each  class,  while  Figure  3.11  shows  the  results  of 


Radial  Complexity 


Figure  3.10  Cross-Validation  result  for  5  training  data  points  from  each  class. 


using  cross-validation  on  data  sets  with  50  training  data  points  from  each  class.  Once  again, 
the  plots  in  each  figure  show  the  error  versus  the  radial  complexity.  The  radial  complexity  for 
good  generalization  in  each  plot  is  that  which  leads  to  minimum  error  on  the  validation  error 
curve.  The  difference  between  each  plot  in  a  given  figure  is  re-initialization  of  the  weights 
and  choosing  a  random  data  set  for  training/validation.  Looking  at  these  two  figures,  we 
see  that  the  radial  complexity  yielding  good  generalization  (where  the  upper  validation  error 
curve  is  minimum)  grows  as  we  increase  the  number  of  training  data  points.  Training  was 
done  for  5,  10,  15,  20,  25,  30,  35,  40,  45,  50,  55,  60,  65,  70,  and  75  data  points  from  each 
class. 
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Radial  Complexity 


Figure  3.11  Cross-Validation  result  for  50  training  data  points  from  each  class. 

Figure  3.12  shows  the  average  good  generalization  radial  complexity  calculated  for 
each  size  data  set.  Using  regression  on  this  data,  we  can  estimate  the  radial  complexity  for  a 
large  number  of  training  data  points  which  yields  good  generalization.  The  radial  complexity 
estimated  for  100  data  points  that  yields  good  generalization  is  on  the  order  of  22.  Looking 
at  Figure  3.13,  notice  that  a  radial  complexity  of  22  is  indeed  representative  of  the  region  in 
which  the  error  on  the  validation  set  begins  to  increase. 

We  randomly  generated  a  small  data  set  (100  vectors  from  each  class)  from  the  TESSA 
data  (3  class)  to  be  used  as  the  training  data  set  and  constructed  an  ANN  with  70  hidden 
nodes.  Using  the  technique  described  in  the  previous  section,  we  used  cross-validational 
early  stopping  with  half  the  training  data  as  validation  data  and  half  remaining  as  training 
data.  The  generalization  error  was  then  estimated  as  the  error  obtained  on  the  remaining 
data  (containing  about  10,000  points)  as  per  Equation  3.42.  The  radial  complexity  expected 
to  provide  good  generalization  for  the  full  data  set  was  estimated,  and  the  small  data  set 
was  trained  until  the  radial  complexity  reached  that  value.  The  effects  of  overtraining  can 
be  seen  in  Figure  3.14.  Overtraining  yields  a  percentage  of  correctly  classified  test  points  of 
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Number  of  Training  Data  Points 


Figure  3.12 


Figure 


Growth  of  the  radial  complexity  providing  good  generalization  as  a  function 
of  the  number  of  training  data  points. 


Radial  Complexity 


.13  Cross-Validation  result  for  100  training  data  points  from  each  class. 


Figure  3.14  Using  cross-validation,  we  can  see  that  the  error  on  the  validation  set  grows 
rapidly  even  as  the  training  error  decreases. 


51.75%.  While  cross-validational  early  stopping  improves  the  correct  classification  rate  to 
61.12%,  the  best  classification  rate  of  63.28%  is  achieved  when  using  the  full  data  set  and 
stopping  training  at  the  estimated  good  generalization  radial  complexity,  pN.  These  results 
are  summarized  in  Table  3.44. 


Stopping  based  on: 

Percent  correctly  classified 


Strain 

^cv 

Pn 

51.75% 

61.12% 

62.75% 

(3.44) 


Each  result  is  an  average  classification  rate  over  10  runs.  The  observed  classification  accuracy 
when  using  the  full  data  set  improves  3%  over  that  observed  when  using  a  smaller  part  for  true 
cross-validational  early  stopping,  and  over  14%  over  that  observed  when  stopping  training 
when  a  low  error  on  the  training  data  is  reached. 

We  have  already  shown  how  standard  training  of  the  ANN  leads  to  a  consistent  change 
in  the  radial  complexity  during  training  of  the  weights,  so  moving  the  coordinate  system 
in  which  the  weights  are  represented  from  Cartesian  to  hyperspherical  allows  us  to  lock  in 
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the  radial  complexity  while  changing  the  other  parameters.  By  converting  our  coordinate 
system  from  Cartesian  to  hyperspherical  (outlined  in  Appendix  B),  we  can  influence  the 
parameters  of  the  weight  vector  not  associated  with  the  radial  complexity  (magnitude  of  the 
weight  vector). 

3.4  Training  at  a  Fixed  Radial  Complexity 

Now  that  we  have  estimated  at  which  radial  complexity  to  train  our  network  to  obtain 
good  generalization,  what  would  the  effect  be  of  restricting  training  to  that  radial  complexity? 
This  analysis  can  be  carried  out  by  using  hyperspherical  methods  such  as  an  Evolution 
Program  or  hyperspherical  backpropagation  which  operate  on  the  angles  only.  Since  we 
have  defined  our  effective  complexity  in  terms  that  are  complementary  to  the  hyperspherical 
coordinate  transformation,  hyperspherical  methods  such  as  these  provide  an  ideal  way  to 
confine  a  weight  vector  to  a  specific  radial  complexity  during  training. 

3.4-1  Genetic  Approach.  Using  EPs  we  can  generate  a  population  of  weight  vectors, 
convert  them  to  hyperspherical  coordinates,  use  evolution  on  the  angles,  and  obtain  a  solution 
that  maintains  a  constant  radial  complexity.  Figure  3.15  shows  the  results  of  using  an  EP  to 
train  the  ANN  at  a  given  radial  complexity  to  classify  the  hand- written  character  set.  The 


Figure  3.15  Using  an  EP  to  train  the  ANN  at  a  specific  radial  complexity. 

error  used  is  the  same  error  used  with  backprop,  but  the  weight  angles  are  updated  using 
genetic  methods  rather  than  backpropagation  methods.  Each  training  epoch  here  is  not  just 
a  pass  through  the  training  data,  but  a  pass  that  yielded  a  lower  error  than  the  previous 
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iteration.  This  method  demonstrates  the  tendency  of  the  training  error  to  decrease,  but  is 
very  slow  compared  to  hyperspherical  backpropagation  which  constrains  the  weight  updates 
to  travel  in  the  direction  of  a  lower  error. 

3-4-2  Hyperspherical  Backpropagation  Approach.  Dunne  first  suggested  transform¬ 
ing  the  weight  vector  using  polar  coordinates  (since  he  was  working  with  only  2  weights)  [17] 
to  analyze  the  behavior  of  the  weights  over  time.  That  research  was  limited  to  a  very  small 
dimensional  weight  space;  here  we  allow  for  any  number  of  weights  since  real-world  problems 
usually  demand  a  very  large  network  to  reach  a  solution.  Thus,  we  use  the  term  “hyperspher¬ 
ical”  coordinates  to  denote  any  radial  coordinate  system  that  has  more  than  three  Cartesian 
coordinates  as  its  starting  frame  of  reference.  Given  that  there  is  some  ideal  hypershell  in 
which  lies  the  weight  vector  that  provides  the  best  generalization,  then  one  hopes  this  weight 
vector  radius  is  what  the  training  methods  limiting  weight  values  seeks;  once  we  determine 
this  value,  we  should  confine  the  training  of  the  weight  vector  to  that  radial  complexity 
so  as  not  to  harm  the  generalization  characteristics  of  the  trained  ANN.  Here,  we  refer  to 
any  method  that  updates  the  weights  using  a  combination  of  standard  backpropagation  and 
hyperspherical  coordinates  as  hyperspherical  backpropagation.  If  we  desire  to  update  the 
angles  directly,  then  we  need  to  know  the  effects  of  changing  those  angles  in  hyperspherical 
coordinates,  i.e.,  what  is  the  change  in  the  error  due  to  a  change  in  a  given  weight  vector 
angle? 

For  hyperspherical  backpropagation,  we  have 

"r+1  =  F  ~  <3-45) 

For  notational  convenience,  we  will  arrange  the  angles  in  the  same  manner  that  we 
arrange  the  weights.  Let  the  weights  be  expressed  as 


Wl  = 

W  l1 

Wl2 

...  w\R  J , 

W2  = 

W21 

W2 2 

•  •  •  W2S1  ] 
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where 


W 1*  = 


and 


W  2j  = 


Wllti 

Wl2,i 

Wlsi* 

Wlhj 

Whj 


i  =  l,2,...,R  , 


W1KJ 

Let  a  concatenated  (column)  weight  vector,  w,  be 


w 


=  [(wi1)7  (wi2)r  •••( W1r)T  ( B1)t  (W21)7  (\V22)T  -■-(W2s1)T  (B2)t 

=  [iCi  VJ2  .  .  .  Ww]T  - 


Now  the  coordinate  transformation  from  Appendix  B  can  be  used  to  yield  a  vector  of  angles 
and  a  magnitude  such  that 


w  =  [0i,  $2, . . .  ,0w-i,  pf 


=  \ewll)T  (1 9m2)T  ...( dwlR)T  (eB1f  (8W2')T  (oW22)t  ...(( 9W2S1f  (0B2y 


t— ‘ 

II 

ewyl 

owl 2 

qWIr 

gW2  = 
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0W2* 

nW2S1 
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9W1  = 
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Accordingly,  we  assign  the  angles  to  a  given  weight  as 

W 1  jyi  =  p  sin  sin  02  -  •  •  sin  0^ix  cos  0j^1 , 

where  the  subscript  j,  i  —  1  simply  implies  the  angle  preceding  the  angle  in 
dimensional  weight  space.  Similarly, 


Blj  =  p sin 0i  sin 02...  sin0^11  cosOB1y 
W2tj  =  psin  6,  sin  0,...  sin  S^j-i  cos 


(3.46) 
the  W- 

(3.47) 

(3.48) 
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and 

B2k  =  psin0isin02---sin0j^icos0f2.  (3.49) 

The  angle  updates  are  derived  in  Appendix  B  as 

C<T+i)  = 

n= 1 

(-ZjWkj  tan O^J  (r)  +  zu+1)W2k(j+1)  cot  9^-  (r)  +  ... 

+zsiwksi  cot  6™J  (t)  +  bk  cot  6™J  (r)), 

(r  + 1)  =  ep  (r)  -  n  £  (y£°  -  4n))  (-52,tan0f  (r)),  (3.50) 

n=l 

«X‘ (t + 1)  =  W-flEE (rin) - tf°) (i - zf)  x 

n= 1 fc=l 

(x,  (-JV1*  tan  (t))  +  x(i+1)Wlj(j+1)  cot  (r)  +  ... 
+xsWljR  cot  (r)  +  Blj  cot (t)), 

and 

«?*  O'  + 1)  =  «f  W-lEE  -  4”’)  Hr2«4*)  (l  -  *}“’)  tan  to),  (3.51) 

n=l fc=l 

where  77  is  the  step  size  coefficient,  r  is  the  time  index  (epoch),  and  Zj  is  the  output  of  hidden 
node  j. 

For  the  purpose  of  maintaining  a  constant  radial  complexity,  another  method  of  per¬ 
forming  hyperspherical  backpropagation  is  to  update  the  weights  in  Cartesian  coordinates 
using  any  standard  backpropagation  technique,  converting  the  new  weights  over  to  hyper¬ 
spherical  coordinates,  resetting  the  magnitude  parameter  to  the  previous  value,  and  then 
converting  back  to  Cartesian  coordinates.  These  two  methods  are  theoretically  identical 
(although  there  are  limitations  that  must  be  placed  on  the  angles  since  they  are  limited  to 
certain  regions),  and  there  is  no  appreciable  difference  in  implementation  time  since  both 
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must  convert  the  weights  over  to  hyperspherical  coordinates  after  each  weight  evaluation  and 
update. 

Here,  we  use  hyperspherical  backpropagation  to  find  a  solution  weight  vector  at  a 
constant  radial  complexity.  A  comparison  of  standard  backpropagation  versus  hyperspher¬ 
ical  backpropagation  when  classifying  the  hand-written  OCR  data  set  is  presented  in  Fig¬ 
ure  3.16.  Notice  that  cross- validational  early  stopping  would  cause  the  ANN  to  stop  training 


Figure  3.16  This  figure  demonstrates  how  using  standard  backprop  to  train  an  ANN  on 
the  hand-written  character  set  compares  to  using  Hyperspherical  Backprop 
to  train  the  ANN  at  a  specific  radial  complexity.  Notice  that  without  hyper¬ 
spherical  backpropagation,  the  algorithm  overtrains  and  begins  to  suffer  from 
an  increase  in  the  validation  set  error. 

at  about  100  epochs,  and  without  hyperspherical  backpropagation  to  lock  the  radial  com¬ 
plexity  in  place,  the  algorithm  overtrains  thereafter  and  begins  to  suffer  from  an  increase  in 
the  cross-validation  error  which  indicates  less  than  optimal  generalization.  With  hyperspher¬ 
ical  backpropagation,  the  error  on  the  validation  set  not  only  does  not  increase,  it  continues 
to  decrease.  This  is  due  to  the  ramifications  of  limiting  the  complexity  and  yet  continuing 
the  training;  since  the  corners  of  any  two  intersecting  decision  sigmoids  can  only  be  so  sharp, 
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the  weight  updates  must  lower  the  error  by  shifting  the  global  positions  of  the  discriminant 
boundaries  rather  than  going  after  outliers. 

3.5  Summary 

The  determination  of  the  initial  radial  complexity  of  an  ANN  based  on  the  prior  dis¬ 
tribution  used  to  generate  our  initial  weight  vector  was  discussed.  This  quantity  determines 
the  starting  point  of  the  ANN  training  algorithm,  and  should  be  of  primary  concern  when 
initializing  the  weights.  Research  in  the  past  has  initialized  the  weights  without  regard  for 
this  quantity,  simply  using  the  same  initial  distribution  for  each  weight  regardless  of  the 
number  of  hidden  nodes.  As  demonstrated,  though,  the  initial  radial  complexity  grows  with 
the  number  of  weights  in  the  ANN  so  the  parameters  of  the  distribution  (bounds  or  variance) 
from  which  each  weight  is  drawn  should  change  as  the  number  of  weights  is  changed. 

By  examining  the  behavior  of  the  radial  complexity  during  network  training,  we  cast 
doubt  on  the  practice  of  re-initializing  an  ANN  with  weights  drawn  from  an  identical  dis¬ 
tribution  as  previous  initializations.  The  behavior  of  the  radial  complexity  is  a  function  of 
the  training,  the  error  function,  and  the  method  of  training,  but  given  these  quantities,  the 
behavior  is  consistent  and  not  a  function  of  the  random  values  to  which  the  weights  are  ini¬ 
tialized  (although  it  is  a  function  of  the  distribution  from  which  the  weights  are  initialized). 

Knowing  the  initial  radial  complexity  and  how  this  radial  complexity  behaves  as  the 
training  set  grows,  we  then  showed  that  the  radial  complexity  of  an  ANN  that  yields  the  best 
generalization  for  the  full  training  data  set  can  be  estimated  by  using  cross- validational  early 
stopping  on  smaller  size  data  sets,  then  using  regression  on  those  resultant  radial  complexities 
to  obtain  the  radial  complexity  allowed  by  training  the  ANN  with  all  available  data,  thereby 
allowing  the  data  to  constrain  the  decision  boundaries  as  much  as  possible.  With  this 
development  of  radial  complexity  estimation,  we  now  have  a  method  of  training  an  ANN 
with  standard  backpropagation  techniques  and  yet  confining  the  magnitude  of  the  weight 
vector  to  a  desired  radial  complexity  so  as  to  maintain  good  generalization  characteristics. 
As  an  example,  we  showed  how  this  method  provided  a  better  estimated  general  classification 
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accuracy  than  true  early  stopping.  This  technique  is  quite  useful  since  the  generalization 
characteristics  of  the  ANN  are  the  primary  concern  of  the  end-user. 

Finally,  we  showed  how  hyperspherical  backpropagation  can  lead  to  decreased  valida¬ 
tion  error  during  training  of  the  ANN.  By  limiting  the  radial  complexity  to  an  estimated 
magnitude,  here  estimated  using  cross-validational  early  stopping,  we  have  forced  the  ANN 
to  lower  the  error  by  shifting  the  location  of  the  overall  discriminant  boundaries  rather  than 
overtraining  on  the  outliers. 
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IV.  Conclusions  and  Recommendations 

4-1  Conclusions 

Here,  we  reviewed  that  the  primary  consideration  when  training  an  ANN  is  the  ability 
of  the  trained  network  to  generalize  well  [7, 59] .  Methods  such  as  cross- validational  early 
stopping  and  regularization  attempt  to  find  a  solution  at  an  effective  complexity  yielding 
good  generalization  (since  the  effective  complexity  of  the  network  determines  its  generaliza¬ 
tion  ability  [14]).  In  light  of  the  research  done  by  Bartlett  [4],  the  effective  complexity  was 
quantified  as  the  magnitude  of  the  weight  vector  (radius  of  the  hypershell  defined  by  the 
weight  vector),  and  here  referred  to  as  the  radial  complexity. 

The  expected  value  of  initial  radial  complexity  of  the  ANN  was  shown  to  be  an  increas¬ 
ing  function  of  the  number  of  weights  (based  on  the  distribution  from  which  each  individual 
weight  is  drawn).  This  expected  initial  radial  complexity  is  important  regardless  of  the  meth¬ 
ods  used  to  train  the  ANN  since  the  initial  radial  complexity  needs  to  be  set  appropriately 
so  as  to  assure  growth  or  decay  into  a  radial  complexity  that  yields  good  generalization. 

The  behavior  of  the  radial  complexity  during  training  was  seen  to  behave  in  a  consistent 
manner  from  run  to  run  even  when  the  weights  were  re-initialized  to  different  starting  values. 
This  behavior  was  shown  to  be  independent  of  the  data  set  and  in  fact  was  independent  of  the 
type  of  training  (provided  the  weights  were  being  guided  toward  an  area  of  lower  perceived 
error  over  the  training  set). 

Radial  complexity  estimation  for  early  stopping  was  shown  to  lead  to  superior  gen¬ 
eralization  when  used  to  train  an  ANN  to  a  desired  radial  complexity,  and  hyperspherical 
backpropagation  was  seen  to  consistently  decrease  the  validation  error  during  training.  No 
previous  technique  has  attempted  to  obtain  a  solution  that  would  remain  at  a  specific  com¬ 
plexity  calculated  to  provide  improved  posterior  probabilities  as  measured  by  the  validation 
error. 
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4-2  Recommendations  for  future  Research 

Global  minimum  search  techniques,  questioned  by  Lawrence  [33]  for  standard  back- 
propagation,  can  be  used  with  hyperspherical  backpropagation  with  the  constraint  that  the 
weight  vector  remains  at  a  specific  radial  complexity.  This  hypershell  global  minimum  (min¬ 
imum  obtainable  training  error  at  a  given  radial  complexity)  would  yield  the  lowest  error  in 
that  hypershell,  and  yet  maximize  the  generalization  capability  of  the  ANN. 

All  methods  used  to  speed  up  standard  backpropagation,  such  as  momentum  and 
instantaneous  backpropagation  [12],  are  usable  with  radial  complexity  estimation  and  hy¬ 
perspherical  backpropagation,  so  there  is  immediately  a  plethora  of  algorithm  tweaks  to 
optimize  training  speed.  Ideally,  a  number  of  solutions  need  to  be  obtained  at  a  given  radial 
complexity  to  form  a  committee  of  networks  which  also  can  help  improve  generalization  [7]). 

This  research  briefly  mentions  the  relationship  between  the  regularization  coefficient, 
a,  and  the  radial  complexity,  p,  when  using  Bayesian  backpropagation  to  train  an  ANN.  A 
logical  next  step  would  be  to  use  the  technique  of  radial  complexity  estimation  to  determine 
the  optimal  a  to  use  during  Bayesian  backpropagation  to  provide  a  result  that  is  based  on 
the  expected  generalization  ability  of  the  final  weight  vector. 
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Appendix  A.  Weight  Update  Formula 


A.l  Standard  Batch  Backpropagation 

To  update  each  weight,  we  use  the  equation 


t+i  r  9Ed(w )  ,  n 

w  T  =  to - .  r  =  1,2,, 

ow 


(A.1) 


where  r  is  the  time  step,  7?  is  the  step  size,  and  w  is  an  individual  weight.  Realizing  that 


N 


Ed  =  '£ed\ 
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where,  for  softmax  error, 


or,  for  sum-squared  error, 


k= 1 


z  Jfc=l 


we  can  carry  out  the  analysis  as  follows. 


A. 1.1  Second  Layer  Weight  Update.  First,  we  will  update  the  second  layer  weights. 
Using  the  chain  rule,  we  see  that 


dEj]  _  dE^  d4w) 

dwkj  ~  da[n)  dwkj ' 


From  Bishop  [7],  we  know  that 
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whether  we  use  softmax  or  sum-squared  error.  Therefore 
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And 
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A. 1.2  First  Layer  Weight  Update.  The  first  layer  weight  update  is  carried  out  by 
setting 

flpi(_n)  K  fiR(n)  An[n)  ft?(.n)  fin{n) 

(A.6) 
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We  already  know  jfy,  so  now  we  determine 
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The  step-size  parameter  rj  (e  >  0)  has  been  the  subject  of  much  research  and  is  usually 
chosen  to  speed  the  training  process.  If  r)  is  chosen  to  be  constant  for  each  weight  update 
(as  is  sometimes  the  case),  we  note  that  the  step  size  is  a  linearly  increasing  function  of  the 
training  set  size,  N.  To  eliminate  this  dependency  of  the  step  size  on  N,  we  choose  to  make 


V=N’ 


(A.9) 


A-2 


where  c  is  a  constant. 


A. 2  Weight  Updates  with  Regularization 

With  our  regularization,  we  see  that,  in  a  generic  case, 
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same  as  in  Appendix  A.  The  only  change  is  the  addition  of  ^J^- 


dEw(  w) 

dw 
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Appendix  B.  Hyperspherical  Coordinate  Transformation 


Given  a  weight  vector 


w  =  [wuw2,...,ww], 


we  wish  to  generate  a  hyperspherical  representation  such  that 


w  =  [0i,02,...,0w-i,p], 


where 
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^1 

=  P  COS  01 , 

w2 

=  p  sin  0i  cos  02, 

Ws 

=  p  sin  0i  sin  02  cos  03 , 
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Ww-l 

=  p sin  0i  sin  02  sin  03.. 

.  sin  dw- 2  cos  dw- 1, 

ww 

=  p  sin  0i  sin  02  sin  03.. 

.sin  dw- 2  sin0w-i- 

(B.4) 

Generating  the  angles  is  a  matter  of  bookkeeping.  All  angles  (except  dw- 1)  are  in  the 
interval  [0,  n),  while  dw-i  is  in  the  interval  [ — 7r,  tt]  .  The  angles  01? ...,  0w-2  are  necessary  to 
project  w  into  the  next  set  of  dimensions  while  dw- 1  is  confined  to  two  dimensions.  The 
radius  and  angles  are  then  found  as  follows: 


P 

9i 

02 


yjwl^wl  +  ...  ~+W^, 

rWU 

arccos[ — J, 

,  arccos[^&]  ^ 
0  0i  =  0 


(B.5) 


B-l 


03 


arccosf  .  y3  .  fl-]  82  ^  0 

J  *-  p  sin  sin  U2  ■*  ' 

0  82  =  0 


(B.6) 


0W— 2 


&W-1 


ar ccos  [  p  gin  0 i  sin  ^  2gin  g w  _  g  ] 
0 


0JV-3  7^  0 
0W-3  =  0 


"  arCC0S[psin^sS:!singty-2l  WW  <  0 
.  +  WW>0 


0 


Q\v- 2  7^  0 

#U'  -2  =  0 


(B.7) 


(B.8) 


where  arccos  denotes  the  principle  inverse  cosine  function.  With  these  relationships,  we 
can  freely  change  the  hyperspherical  parameters  of  a  given  weight  vector  to  see  how  those 
changes  affect  the  error. 
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Appendix  C.  Derivation  of  Angle  Updates  for  Hyperspherical 

Backpropagation 

For  completeness,  we  derive  the  angle  updates  when  using  hyperspherical  backprop. 


C.l  Second  Layer  Angle  Update 

First,  we  will  update  the  second  layer  weights.  Once  again,  remember  that 

Bd  =  E4")- 

n= 1 


Using  the  chain  rule,  we  see  that 
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From  Appendix  A,  we  know  that 
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Therefore 
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C.2  First  Layer  Angle  Update 

Now,  the  first  layer  weight  update  is  carried  out  by  setting 
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We  can  find  in  the  same  way  we  found 
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This  leads  to 
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